From duke at openjdk.org Sat Feb 1 00:48:27 2025 From: duke at openjdk.org (Johannes Graham) Date: Sat, 1 Feb 2025 00:48:27 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v22] In-Reply-To: References: Message-ID: > C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: try fewer tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/d7bca8a3..112bcca8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=20-21 Stats: 22 lines in 1 file changed: 8 ins; 5 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Sat Feb 1 01:32:55 2025 From: duke at openjdk.org (Johannes Graham) Date: Sat, 1 Feb 2025 01:32:55 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v23] In-Reply-To: References: Message-ID: > C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: re-add tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/112bcca8..2c28f3e9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=22 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=21-22 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Sat Feb 1 02:03:45 2025 From: duke at openjdk.org (Johannes Graham) Date: Sat, 1 Feb 2025 02:03:45 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v24] In-Reply-To: References: Message-ID: > C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: add sanity asserts to tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/2c28f3e9..9effc3d8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=23 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=22-23 Stats: 10 lines in 1 file changed: 5 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Sat Feb 1 02:09:48 2025 From: duke at openjdk.org (Johannes Graham) Date: Sat, 1 Feb 2025 02:09:48 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v19] In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 10:31:48 GMT, Quan Anh Mai wrote: >> Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: >> >> move template def to header > > src/hotspot/share/opto/addnode.hpp line 257: > >> 255: >> 256: template >> 257: static S calc_xor_max(const S hi_0, const S hi_1) { > > If this is exposed in the header file, then I think it would be better to have it be a static function of a class to avoid convoluting the global namespace. Suggestion: Can you have `XorINode::calc_max` which will call a static `calc_xor_max` inside the cpp file. I've reorganized it. > test/hotspot/gtest/opto/test_xor_node.cpp line 50: > >> 48: >> 49: template >> 50: void test_exhaustive_values(S hi_0, S hi_1){ > > You may want to test exhaustively for the `hi` values here, too. E.g: > > void test_exhaustive(S limit) { > for (S hi0 = 0; hi0 <= limit; hi0++) { > for (S hi1 = 0; hi1 <= limit; hi1++) { > S max = calc_max(hi0, hi1); > for (S v0 = 0; v0 <= hi0; v0++) { > for (S v1 = 0; v1 <= hi1; v1++) { > S v = v0 | v1; > EXPECT_LE(v, max); > } > } > } > } > } Was my intention that the test should have been doing that. I've cleaned it up to clarify it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1938160654 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1938160433 From duke at openjdk.org Sat Feb 1 06:25:47 2025 From: duke at openjdk.org (duke) Date: Sat, 1 Feb 2025 06:25:47 GMT Subject: Withdrawn: 8320998: RISC-V: C2 RoundDoubleModeV In-Reply-To: References: Message-ID: On Tue, 24 Sep 2024 16:01:47 GMT, Dingli Zhang wrote: > Hi all, > > This patch will add RoundDoubleModeV intrinsics for riscv64. The vector implementation is similar to the scalar version. Please take a look and have some reviews. Thanks a lot! > > Just like https://github.com/openjdk/jdk/pull/17745, current test shows that, it bring performance gain when vlenb >= 32 (which is on k1), but bring regression when vlenb == 16 (which is on k230). So I only enable the intrinsic when vlenb >= 32. > > Please compare the data below, thanks! > > ## Test > ### Test on k1 > test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java > test/hotspot/jtreg/compiler/floatingpoint/TestRound.java > test/jdk/java/lang/Math/RoundTests.java > test/micro/org/openjdk/bench/java/math/FpRoundingBenchmark.java > ### Test on qemu(enable RVV1.0) > test/jdk/jdk/incubator/vector/* > > ## Performance - with Intrinsic > ### on k1 > Benchmark on k1 (+intrinsic) > > Benchmark (TESTSIZE) Mode Cnt Score Error Units > FpRoundingBenchmark.test_ceil 2048 thrpt 15 58.973 ? 0.460 ops/ms > FpRoundingBenchmark.test_floor 2048 thrpt 15 59.873 ? 0.054 ops/ms > FpRoundingBenchmark.test_rint 2048 thrpt 15 59.460 ? 0.552 ops/ms > > > Benchmark on k1 (-intrinsic) > > Benchmark (TESTSIZE) Mode Cnt Score Error Units > FpRoundingBenchmark.test_ceil 2048 thrpt 15 51.335 ? 0.068 ops/ms > FpRoundingBenchmark.test_floor 2048 thrpt 15 51.356 ? 0.062 ops/ms > FpRoundingBenchmark.test_rint 2048 thrpt 15 51.387 ? 0.059 ops/ms > > ### on k230 > Benchmark on k230 (+intrinsic, enable intrinsic even when vlenb == 16) > > Benchmark (TESTSIZE) Mode Cnt Score Error Units > FpRoundingBenchmark.test_ceil 2048 thrpt 15 28.263 ? 0.837 ops/ms > FpRoundingBenchmark.test_floor 2048 thrpt 15 28.130 ? 0.789 ops/ms > FpRoundingBenchmark.test_rint 2048 thrpt 15 28.241 ? 0.868 ops/ms > > > Benchmark on k230 (-intrinsic, enable intrinsic even when vlenb == 16) > > Benchmark (TESTSIZE) Mode Cnt Score Error Units > FpRoundingBenchmark.test_ceil 2048 thrpt 15 44.391 ? 1.249 ops/ms > FpRoundingBenchmark.test_floor 2048 thrpt 15 44.423 ? 1.187 ops/ms > FpRoundingBenchmark.test_rint 2048 thrpt 15 44.441 ? 1.218 ops/ms > > > ## Performance - without Intrinsic > ### on k1, intrinsic disabled due to -Us... This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/21164 From qamai at openjdk.org Sat Feb 1 09:19:48 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 1 Feb 2025 09:19:48 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v24] In-Reply-To: References: Message-ID: On Sat, 1 Feb 2025 02:03:45 GMT, Johannes Graham wrote: >> C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. > > Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: > > add sanity asserts to tests Very nice, I think the patch looks good, please do another round of style refinement. In particular, make sure that there is no white space after `(` or before `)`, and after `if` or `for` we prefer having a whitespace before the `(`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23089#issuecomment-2628870825 From duke at openjdk.org Sat Feb 1 10:57:36 2025 From: duke at openjdk.org (Matthias Ernst) Date: Sat, 1 Feb 2025 10:57:36 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v15] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: dropped bug ref. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/397cf15f..0e9de51d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=13-14 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From duke at openjdk.org Sat Feb 1 12:41:02 2025 From: duke at openjdk.org (altrisi) Date: Sat, 1 Feb 2025 12:41:02 GMT Subject: RFR: 8333893: Optimization for StringBuilder append boolean & null [v20] In-Reply-To: References: Message-ID: On Fri, 18 Oct 2024 21:56:53 GMT, Shaojin Wen wrote: >> After PR https://github.com/openjdk/jdk/pull/16245, C2 optimizes stores into primitive arrays by combining values ??into larger stores. >> >> This PR rewrites the code of appendNull and append(boolean) methods so that these two methods can be optimized by C2. > > Shaojin Wen has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 26 additional commits since the last revision: > > - Merge remote-tracking branch 'origin/optim_str_builder_append_202406' into optim_str_builder_append_202406 > - fix build error > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'origin/optim_str_builder_append_202406' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - revert test > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - ... and 16 more: https://git.openjdk.org/jdk/compare/26d36e92...457735c9 I have concerns with the changes here, because of what it can cause when an (incorrect) program is sharing a StringBuilder between multiple threads. Even without any reordering, a thread could start an `append(boolean|null)`, pass the `ensureCapacityInternal` call, then another thread up the count by enough to no longer have enough capacity. Then the first thread could proceed to read the new (no longer enough!) count, and write without any bounds checks outside of the array. Easy to reproduce with a debugger and some breakpoints. The previous code would have thrown an exception at an explicit or implicit bounds check, but with the changes here that'd no longer happen.
For example |(initial state)| |-----------------| | value.length = 16 | | count = 11 | | Thread 1 | Thread 2 | |---------------------------|--------------------| | append(true) | | | -ensureCap(11+4) | | | --nothing to do | | | | append("str") | | | - ensureCap(11+3) | | | -- nothing to do | | | - ... | | | - this.count <- 14 | | - cnt <- this.count (14!) | | | - val <- this.value | | | - val[cnt] <-u 't' | | | - val[cnt+1] <-u 'r' | | | - val[cnt+2 (16!)] <-u 'u' | | | - ... | |
------------- PR Comment: https://git.openjdk.org/jdk/pull/19626#issuecomment-2628937397 From swen at openjdk.org Sat Feb 1 13:45:06 2025 From: swen at openjdk.org (Shaojin Wen) Date: Sat, 1 Feb 2025 13:45:06 GMT Subject: RFR: 8333893: Optimization for StringBuilder append boolean & null [v20] In-Reply-To: References: Message-ID: On Fri, 18 Oct 2024 21:56:53 GMT, Shaojin Wen wrote: >> After PR https://github.com/openjdk/jdk/pull/16245, C2 optimizes stores into primitive arrays by combining values ??into larger stores. >> >> This PR rewrites the code of appendNull and append(boolean) methods so that these two methods can be optimized by C2. > > Shaojin Wen has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 26 additional commits since the last revision: > > - Merge remote-tracking branch 'origin/optim_str_builder_append_202406' into optim_str_builder_append_202406 > - fix build error > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'origin/optim_str_builder_append_202406' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - revert test > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - ... and 16 more: https://git.openjdk.org/jdk/compare/95d177e1...457735c9 AbstractStringBuilder's value/coder/count are not volatile. If you use StringBuilder, you may not get the correct result. If you want thread safety, you should use StringBuffer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19626#issuecomment-2628958534 From duke at openjdk.org Sat Feb 1 15:20:07 2025 From: duke at openjdk.org (altrisi) Date: Sat, 1 Feb 2025 15:20:07 GMT Subject: RFR: 8333893: Optimization for StringBuilder append boolean & null [v20] In-Reply-To: References: Message-ID: On Fri, 18 Oct 2024 21:56:53 GMT, Shaojin Wen wrote: >> After PR https://github.com/openjdk/jdk/pull/16245, C2 optimizes stores into primitive arrays by combining values ??into larger stores. >> >> This PR rewrites the code of appendNull and append(boolean) methods so that these two methods can be optimized by C2. > > Shaojin Wen has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 26 additional commits since the last revision: > > - Merge remote-tracking branch 'origin/optim_str_builder_append_202406' into optim_str_builder_append_202406 > - fix build error > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'origin/optim_str_builder_append_202406' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - revert test > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 > - ... and 16 more: https://git.openjdk.org/jdk/compare/54dfd56d...457735c9 Sure, I already mentioned such program would be incorrect, but one thing is an incorrect result in the StringBuilder and another one is writing outside the bounds of the array into other parts (possibly objects) of the heap, or even outside, causing a VM crash. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19626#issuecomment-2628991914 From duke at openjdk.org Sat Feb 1 17:09:11 2025 From: duke at openjdk.org (Johannes Graham) Date: Sat, 1 Feb 2025 17:09:11 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v25] In-Reply-To: References: Message-ID: <5TMrqMXNNsCanL4BuXdG2pidWEpv--OD5a9DEg98cn0=.915a1ead-fc26-4268-bda5-8ef5ef2bae3f@github.com> > C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: formatting ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/9effc3d8..ab804b88 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=24 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=23-24 Stats: 15 lines in 1 file changed: 0 ins; 3 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Sat Feb 1 19:27:01 2025 From: duke at openjdk.org (Johannes Graham) Date: Sat, 1 Feb 2025 19:27:01 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v26] In-Reply-To: References: Message-ID: <6sWBRolcCZXOe1pXDSfyBUvtfEuzV1MdMXUVpji42_4=.6c5f7da0-ee1d-485f-a99d-d8c520002dbd@github.com> > C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: add IR tests for long, simplify tests for int ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/ab804b88..cf779497 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=25 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=24-25 Stats: 86 lines in 2 files changed: 73 ins; 4 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Sat Feb 1 19:33:59 2025 From: duke at openjdk.org (Johannes Graham) Date: Sat, 1 Feb 2025 19:33:59 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v26] In-Reply-To: <6sWBRolcCZXOe1pXDSfyBUvtfEuzV1MdMXUVpji42_4=.6c5f7da0-ee1d-485f-a99d-d8c520002dbd@github.com> References: <6sWBRolcCZXOe1pXDSfyBUvtfEuzV1MdMXUVpji42_4=.6c5f7da0-ee1d-485f-a99d-d8c520002dbd@github.com> Message-ID: On Sat, 1 Feb 2025 19:27:01 GMT, Johannes Graham wrote: >> C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. > > Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: > > add IR tests for long, simplify tests for int Thanks. I've done another round of format fixing. I've also simplified the IR tests so they don't try to cover as much as gtest does, and added equivalent tests for long. I have temporarily left the more elaborate tests commented out in XorINodeIdealizationTests. I will remove them if nobody thinks they are worth keeping. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23089#issuecomment-2629081832 From jkarthikeyan at openjdk.org Sat Feb 1 22:31:55 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sat, 1 Feb 2025 22:31:55 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 11:22:47 GMT, Jatin Bhateja wrote: > Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature. > Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations. > > Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence. > > Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction. > > Following are the performance numbers of the following existing microbenchmark > https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java > > Patch passes following validation test > [test/jdk/java/lang/Math/IeeeRecommendedTests.java > ](https://github.com/openjdk/jdk/blob/master/test/jdk/java/lang/Math/IeeeRecommendedTests.java) > > > Granite Rapids-AP (P-core Xeon) > Baseline AVX512: > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 1296.141 ops/ns > Signum._7_copySignDoubleTest thrpt 2 838.954 ops/ns > > Withopt : > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 940.240 ops/ns > Signum._7_copySignDoubleTest thrpt 2 967.370 ops/ns > > Baseline AVX2: > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 63.673 ops/ns > Signum._7_copySignDoubleTest thrpt 2 26.898 ops/ns > > Withopt : > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 785.801 ops/ns > Signum._7_copySignDoubleTest thrpt 2 558.710 ops/ns > > Sierra Forest (E-core Xeon) > Baseline: > Benchmark (seed) Mode Cnt Score Error Units > o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 40.528 ops/ns > o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 25.101 ops/ns > > Withopt: > Benchmark (seed) Mode Cnt Score Error Units > o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 676.101 ops/ns > o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 ... I think this is a good improvement! Having more intrinsics available for AVX2 targets is nice. I've left some comments below. src/hotspot/cpu/x86/x86.ad line 1613: > 1611: case Op_CopySignD: > 1612: case Op_CopySignF: > 1613: if (UseAVX < 1 || !is_LP64) { Should it be limited to just AVX2, or can the new rules work on AVX1 as well? Since they only use instructions that are available to AVX1. src/hotspot/cpu/x86/x86.ad line 6769: > 6767: > 6768: instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{ > 6769: predicate(!VM_Version::supports_avx512vl()); Suggestion: predicate(UseAVX > 0 && !VM_Version::supports_avx512vl()); Just to be a bit more explicit (and same for the one below). ------------- PR Review: https://git.openjdk.org/jdk/pull/23386#pullrequestreview-2588420458 PR Review Comment: https://git.openjdk.org/jdk/pull/23386#discussion_r1938356114 PR Review Comment: https://git.openjdk.org/jdk/pull/23386#discussion_r1938356134 From syan at openjdk.org Sun Feb 2 02:48:59 2025 From: syan at openjdk.org (SendaoYan) Date: Sun, 2 Feb 2025 02:48:59 GMT Subject: Integrated: 8349142: [JMH] compiler.MergeLoadBench.getCharBV fails In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 15:26:28 GMT, SendaoYan wrote: > Hi all, > > There are two JMH tests fails after [JDK-8344168](https://bugs.openjdk.org/browse/JDK-8344168) merged. This PR fix the JMH tests fails similar to https://github.com/oracle/graal/pull/10602, just remove the unnecessary Unsafe.ARRAY_BYTE_BASE_OFFSET at the second argument of VarHandle.get. > Change has been verified locally, test-fix only, no risk. This pull request has now been integrated. Changeset: 2cce5eeb Author: SendaoYan URL: https://git.openjdk.org/jdk/commit/2cce5eeb092b68b4e4ce6a8289a8aa567f47c973 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod 8349142: [JMH] compiler.MergeLoadBench.getCharBV fails Reviewed-by: liach ------------- PR: https://git.openjdk.org/jdk/pull/23393 From syan at openjdk.org Sun Feb 2 02:48:59 2025 From: syan at openjdk.org (SendaoYan) Date: Sun, 2 Feb 2025 02:48:59 GMT Subject: RFR: 8349142: [JMH] compiler.MergeLoadBench.getCharBV fails [v2] In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 16:42:30 GMT, Chen Liang wrote: >> SendaoYan has updated the pull request incrementally with one additional commit since the last revision: >> >> remove Unsafe.ARRAY_BYTE_BASE_OFFSET from second argument of VarHandle.get > > Looks good, thanks for fixing these misuses of Unsafe.ARRAY_BYTE_BASE_OFFSET. Looks like a copy-paste error when the benchmark was created, but this wasn't statically detected until the VarHandle type checks failed due to the long offsets. Thanks for the review. @liach ------------- PR Comment: https://git.openjdk.org/jdk/pull/23393#issuecomment-2629214396 From aph at openjdk.org Sun Feb 2 17:40:50 2025 From: aph at openjdk.org (Andrew Haley) Date: Sun, 2 Feb 2025 17:40:50 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Fri, 22 Nov 2024 00:32:10 GMT, Dean Long wrote: > replace C1 patching with deoptimization, like on DEOPTIMIZE_WHEN_PATCHING aarch64. It might be worth looking at that. My experiments at the time the AArch64 port was done indicated that only 10% of deoptimization on AArch64 was caused by patching events. Most of the 90% remaining was tiered events. It is possible to generate C1 code for AArch64 that is patchable, but the frequent additional indirections in generated code are worse for performance than the occasional deoptimization. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2629485139 From duke at openjdk.org Sun Feb 2 18:31:54 2025 From: duke at openjdk.org (Matthias Ernst) Date: Sun, 2 Feb 2025 18:31:54 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v16] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 32 additional commits since the last revision: - Merge branch 'openjdk:master' into mernst/JDK-8346664 - dropped bug ref. - indent - consistently label failing cases due to Align requirements. - Apply suggestions from code review Co-authored-by: Emanuel Peter - disable `|` . comments. - "should never vectorize" only holds for long[] input. - completely disable IR rule - You can have two @IR rules with different applyIfs. - expect vectorization of i|7 instead of i&7 i&7 variant expect "= 0". - ... and 22 more: https://git.openjdk.org/jdk/compare/52a5997d...98da0e0a ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/0e9de51d..98da0e0a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=14-15 Stats: 17118 lines in 260 files changed: 7406 ins; 6237 del; 3475 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From duke at openjdk.org Sun Feb 2 21:36:03 2025 From: duke at openjdk.org (Matthias Ernst) Date: Sun, 2 Feb 2025 21:36:03 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: jlong, not long ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/98da0e0a..58375582 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=15-16 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From amitkumar at openjdk.org Mon Feb 3 03:08:31 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 3 Feb 2025 03:08:31 GMT Subject: RFR: 8349193: compiler/intrinsics/TestContinuationPinningAndEA.java missing @requires vm.continuations Message-ID: As title says, test is missing require vm.continuations, as continuations are not yet supported on s390x, I saw this test failure in my recent builds. ------------- Commit messages: - adds requires vm.continuations Changes: https://git.openjdk.org/jdk/pull/23412/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23412&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349193 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23412.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23412/head:pull/23412 PR: https://git.openjdk.org/jdk/pull/23412 From jkarthikeyan at openjdk.org Mon Feb 3 04:45:40 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 3 Feb 2025 04:45:40 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts Message-ID: Hi all, This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: Baseline Patch Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! ------------- Commit messages: - Subword vectorization Changes: https://git.openjdk.org/jdk/pull/23413/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8342095 Stats: 353 lines in 13 files changed: 331 ins; 3 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/23413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23413/head:pull/23413 PR: https://git.openjdk.org/jdk/pull/23413 From jkarthikeyan at openjdk.org Mon Feb 3 05:13:49 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 3 Feb 2025 05:13:49 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 04:40:23 GMT, Jasmine Karthikeyan wrote: > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 27: > 25: /* > 26: * @test > 27: * @bug 8183390 8340010 8342095 Review note: This test was missing a `@bug` annotation, so I went through the history and added the ones that substantially modified the test. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1938788814 From epeter at openjdk.org Mon Feb 3 06:51:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 06:51:52 GMT Subject: RFR: 8348659: AArch64: IR rule failure with compiler/loopopts/superword/TestSplitPacks.java In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 16:25:11 GMT, Bhavana Kilambi wrote: >> "test5a" in this file fails on Graviton3 (32B, SVE) as the compiler fails to match IR rules for vector size 2. This is because the minimum vector size for aarch64 machines is 8B and it does not support generation of vectors of 2 short values. >> >> Modified the IR rules to have two separate rules - one for sse4.1 and another for sve. >> >> The test now passes on Graviton3. > > Hi @eme64 , can you please review this patch as well? Thanks :) @Bhavana-Kilambi The patch looks good to me. I've launched some testing, just in case. Please ping me in 24h for an update ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23385#issuecomment-2630106873 From thartmann at openjdk.org Mon Feb 3 06:53:46 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Feb 2025 06:53:46 GMT Subject: RFR: 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion In-Reply-To: References: Message-ID: On Mon, 13 Jan 2025 14:49:10 GMT, Roland Westrelin wrote: > I investigated the failure from the `Test.java` that's attached to the > bug. The failure with this test is only reproducible up to 8334060 > (Implementation of Late Barrier Expansion for G1) so experiments I > describe here are from the source code for the commit right before it. > > Peak malloc memory usage reported by NMT is: 1.3GB > > `PhaseCFG::global_code_motion()`, when `OptoRegScheduling` is true, > creates a `PhaseIFG` that's, when initialized, allocates `_adjs`: a > `maxlrg` array of `IndexSet`s that can contain up to `maxlrg`. > > `maxlrg` in this case is 122839. An `IndexSet` is an array of pointers > to a 256 bit bitset: one `IndexSet` array needs: > > > 122839 / 256 * 8 = 3832 > > > and there are of 122839: > > > 3832 * 122839 = ~470 MB > > > It turns out the `PhaseIFG` object when used from > `PhaseCFG::global_code_motion()` doesn't even use the `_adjs` > array. So a patch like: > > > diff --git a/src/hotspot/share/opto/chaitin.hpp b/src/hotspot/share/opto/chaitin.hpp > index cf02deb6019..4e5333bf181 100644 > --- a/src/hotspot/share/opto/chaitin.hpp > +++ b/src/hotspot/share/opto/chaitin.hpp > @@ -258,7 +258,7 @@ class PhaseIFG : public Phase { > VectorSet *_yanked; > > PhaseIFG( Arena *arena ); > - void init( uint maxlrg ); > + void init( uint maxlrg, bool no_adjs = false ); > > // Add edge between a and b. Returns true if actually added. > int add_edge( uint a, uint b ); > diff --git a/src/hotspot/share/opto/gcm.cpp b/src/hotspot/share/opto/gcm.cpp > index ebdefe597ff..fefd75a88c5 100644 > --- a/src/hotspot/share/opto/gcm.cpp > +++ b/src/hotspot/share/opto/gcm.cpp > @@ -1704,7 +1704,9 @@ void PhaseCFG::global_code_motion() { > rm_live.reset_to_mark(); // Reclaim working storage > IndexSet::reset_memory(C, &live_arena); > uint node_size = regalloc._lrg_map.max_lrg_id(); > - ifg.init(node_size); // Empty IFG > + ifg.init(node_size, true); // Empty IFG > regalloc.set_ifg(ifg); > regalloc.set_live(live); > regalloc.gather_lrg_masks(false); // Collect LRG masks > diff --git a/src/hotspot/share/opto/ifg.cpp b/src/hotspot/share/opto/ifg.cpp > index d12698121b9..e42121c2254 100644 > --- a/src/hotspot/share/opto/ifg.cpp > +++ b/src/hotspot/share/opto/ifg.cpp > @@ -42,18 +42,24 @@ > PhaseIFG::PhaseIFG( Arena *arena ) : Phase(Interference_Graph), _arena(arena) { > } > > -void PhaseIFG::init( uint maxlrg ) { > +void PhaseIFG::init( uint maxlrg, bool no_adjs ) { > _maxlrg = maxlrg; > _yanked = new (_arena) VectorSet(_arena); > _is_square = false; > // Make uninitialized adjacency lists > - ... My test script complains patching file src/hotspot/share/opto/loopnode.cpp Reversed (or previously applied) patch detected! Could you please merge with master? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23075#issuecomment-2630109726 From epeter at openjdk.org Mon Feb 3 07:07:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 07:07:51 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <-c7xXeuSN-6QD-k6MA1-7Cv17ztENnt7Q0U6PprRrf0=.afd866ca-f6d8-4f09-8c66-ada7bd38c67f@github.com> Message-ID: On Fri, 31 Jan 2025 19:41:29 GMT, Vladimir Kozlov wrote: > The issue is we should avoid creating new irreducible loops. Absolutely. This is not a fix at all. But since we can catch that the state is wrong here, we should bail out instead of continuing on in production. That is at least a little improvement. The bug-fix would come in a second step, and may be much more complicated as it would have to reconsider what to do about `split_if`. Do we postpone until after loop-opts for example? Before considering such an involved fix, I would rather want to do this "defensive" patch first. Does that make sense? This also allows us to integrate [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570), which is currently blocked by the failing assert (which is now disabled, but bailout instead, can be enabled with the flag). > What if l is already marked as irreducible l->_irreducible = 1 by following code? I don't see check above for such case. If `l->_irreducible = 1`, then the corresponding node should have been marked as `MaybeIrreducibleEntry`, and so should also `m` be marked with `MaybeIrreducibleEntry`. If that is the case, we are fine, these `Region` were created at parsing and we currently consider that fine. But if one of the `Region` of the now irreducible loop was not marked accordingly already during parsing, then a new irreducible loop appeared during compilation - and that's not good. > And again come here but secondary_entry can be irreducible (it is Region for example or already marked) and we skip bailout. Why it is okay? If we marked a `Region` with `MaybeIrreducibleEntry`, then we treat it differently in some optimizations. For example, when the region loses a control input, we have to check if the loop is now dead, with a global connectivity search. For `LoopNode` losing the control input means we already know that the loop is dead, since that was the only loop entry. But for irreducible loops, losing one entry means we do not know if there is still a secondary entry or not. For context see [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126). Does that answer your question? If not, we may have to talk about it offline ;) PS: The whole irreducible loop handling is still broken actually, see [JDK-8308675](https://bugs.openjdk.org/browse/JDK-8308675). We plan to eventually also disallow creation of irreducible loops at parsing. But that's a big project, and we were hoping for a student project. But we still also should not introduce new irreducible loops during parsing, and that's what we are dealing with here in this bug. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1938879466 From epeter at openjdk.org Mon Feb 3 07:29:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 07:29:56 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v14] In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 08:01:30 GMT, Matthias Ernst wrote: >> Matthias Ernst has updated the pull request incrementally with two additional commits since the last revision: >> >> - indent >> - consistently label failing cases due to Align requirements. > > Thank you! All comment changes are in, and all (modulo Win still in progress) presubmit tests have passed in https://github.com/mernst-github/jdk/actions/runs/13067775176 . @mernst-github Thanks for the updates! The last tests seem to have passed on all platforms except windows x64, with some strang wrong results failures. I think I'll re-run the tests to see if it persists, I'm not yet sure if it's because of your patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2630164582 From duke at openjdk.org Mon Feb 3 07:32:57 2025 From: duke at openjdk.org (Matthias Ernst) Date: Mon, 3 Feb 2025 07:32:57 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Sun, 2 Feb 2025 21:36:03 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > jlong, not long Yes, win was failing due to a mixup between long (32bit) and jlong. The last commit fixed the win presubmit for me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2630169058 From epeter at openjdk.org Mon Feb 3 07:37:55 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 07:37:55 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 07:30:07 GMT, Matthias Ernst wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> jlong, not long > > Yes, win was failing due to a mixup between long (32bit) and jlong. The last commit fixed the win presubmit for me. @mernst-github Ah, classic ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2630176594 From chagedorn at openjdk.org Mon Feb 3 09:03:47 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Feb 2025 09:03:47 GMT Subject: RFR: 8349193: compiler/intrinsics/TestContinuationPinningAndEA.java missing @requires vm.continuations In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 03:04:03 GMT, Amit Kumar wrote: > As title says, test is missing require vm.continuations, as continuations are not yet supported on s390x, I saw this test failure in my recent builds. Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23412#pullrequestreview-2589284145 From thartmann at openjdk.org Mon Feb 3 09:28:54 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Feb 2025 09:28:54 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Fri, 24 Jan 2025 20:37:32 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp I see more failures in testing: compiler/jsr292/CallSiteDepContextTest.java fails on Windows with -XX:+UseZGC: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (workspace\open\src\hotspot\share\code\dependencies.cpp:2038), pid=109676, tid=68072 # assert(call_site->is_a(vmClasses::CallSite_klass())) failed: sanity Current thread (0x000002b4343ac600): JavaThread "MainThread" [_thread_in_vm, id=68072, stack(0x000000f121300000,0x000000f121400000) (1024K)] Stack: [0x000000f121300000,0x000000f121400000], sp=0x000000f1213fe9f0, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [jvm.dll+0x601c63] Dependencies::check_call_site_target_value+0x3e3 (dependencies.cpp:2038) V [jvm.dll+0x60182f] Dependencies::DepStream::check_call_site_dependency+0x6f (dependencies.cpp:2145) V [jvm.dll+0xd0898d] nmethod::check_dependency_on+0x7d (nmethod.cpp:2853) V [jvm.dll+0x609a19] DependencyContext::mark_dependent_nmethods+0xb9 (dependencyContext.cpp:74) V [jvm.dll+0xcd2ff2] MethodHandles::mark_dependent_nmethods+0xa2 (methodHandles.cpp:953) V [jvm.dll+0xccbf2e] MHN_setCallSiteTargetNormal+0x1ee (methodHandles.cpp:1211) C 0x000002b02377ebac (no source info available) Same test also fails with: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/open/src/hotspot/share/oops/compressedKlass.inline.hpp:88), pid=95938, tid=35843 # assert(nk >= _lowest_valid_narrow_klass_id && nk <= _highest_valid_narrow_klass_id) failed: narrowKlass ID out of range (3131947710) --------------- T H R E A D --------------- Current thread (0x0000000142865e10): JavaThread "MainThread" [_thread_in_vm, id=35843, stack(0x0000000170c4c000,0x0000000170e4f000) (2060K)] Stack: [0x0000000170c4c000,0x0000000170e4f000], sp=0x0000000170e4dd00, free space=2055k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.dylib+0x1161294] VMError::report(outputStream*, bool)+0x1aac (compressedKlass.inline.hpp:88) V [libjvm.dylib+0x1164880] VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x548 V [libjvm.dylib+0x576b0c] print_error_for_unit_test(char const*, char const*, char*)+0x0 V [libjvm.dylib+0x70a898] CompressedKlassPointers::check_encodable(void const*)+0x0 V [libjvm.dylib+0x585d24] oopDesc::klass() const+0x104 V [libjvm.dylib+0x5afbec] Dependencies::check_call_site_target_value(oop, oop, CallSiteDepChange*)+0xcc V [libjvm.dylib+0x5b09e4] Dependencies::DepStream::check_call_site_dependency(CallSiteDepChange*)+0x84 V [libjvm.dylib+0xde663c] nmethod::check_dependency_on(DepChange&)+0x50 V [libjvm.dylib+0x5b291c] DependencyContext::mark_dependent_nmethods(DeoptimizationScope*, DepChange&)+0xa4 V [libjvm.dylib+0xda4dfc] MethodHandles::mark_dependent_nmethods(DeoptimizationScope*, Handle, Handle)+0xec V [libjvm.dylib+0xda7ea4] MHN_setCallSiteTargetNormal+0x304 j java.lang.invoke.MethodHandleNatives.setCallSiteTargetNormal(Ljava/lang/invoke/CallSite;Ljava/lang/invoke/MethodHandle;)V+0 java.base at 25-internal j java.lang.invoke.CallSite.setTargetNormal(Ljava/lang/invoke/MethodHandle;)V+7 java.base at 25-internal j java.lang.invoke.MutableCallSite.setTarget(Ljava/lang/invoke/MethodHandle;)V+2 java.base at 25-internal j compiler.jsr292.CallSiteDepContextTest.testGC(ZZ)V+267 j compiler.jsr292.CallSiteDepContextTest.main([Ljava/lang/String;)V+11 j java.lang.invoke.LambdaForm$DMH+0x00000c0000001c00.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base at 25-internal j java.lang.invoke.LambdaForm$MH+0x00000c0000003400.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base at 25-internal j java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base at 25-internal j jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base at 25-internal j jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base at 25-internal j java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base at 25-internal j com.sun.javatest.regtest.agent.MainWrapper$MainTask.run()V+134 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base at 25-internal j java.lang.Thread.run()V+19 java.base at 25-internal v ~StubRoutines::call_stub 0x00000001160d0180 V [libjvm.dylib+0x88beec] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x448 V [libjvm.dylib+0x88ac08] JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0x1c4 V [libjvm.dylib+0x88ada4] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x6c V [libjvm.dylib+0xa1289c] thread_entry(JavaThread*, JavaThread*)+0x12c V [libjvm.dylib+0x8c2088] JavaThread::thread_main_inner()+0x1a8 V [libjvm.dylib+0x10ae844] Thread::call_run()+0xf4 V [libjvm.dylib+0xe467f0] thread_native_entry(Thread*)+0x138 C [libsystem_pthread.dylib+0x6f94] _pthread_start+0x88 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j java.lang.invoke.MethodHandleNatives.setCallSiteTargetNormal(Ljava/lang/invoke/CallSite;Ljava/lang/invoke/MethodHandle;)V+0 java.base at 25-internal j java.lang.invoke.CallSite.setTargetNormal(Ljava/lang/invoke/MethodHandle;)V+7 java.base at 25-internal j java.lang.invoke.MutableCallSite.setTarget(Ljava/lang/invoke/MethodHandle;)V+2 java.base at 25-internal j compiler.jsr292.CallSiteDepContextTest.testGC(ZZ)V+267 j compiler.jsr292.CallSiteDepContextTest.main([Ljava/lang/String;)V+11 j java.lang.invoke.LambdaForm$DMH+0x00000c0000001c00.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base at 25-internal j java.lang.invoke.LambdaForm$MH+0x00000c0000003400.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base at 25-internal j java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base at 25-internal j jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base at 25-internal j jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base at 25-internal j java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base at 25-internal j com.sun.javatest.regtest.agent.MainWrapper$MainTask.run()V+134 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base at 25-internal j java.lang.Thread.run()V+19 java.base at 25-internal v ~StubRoutines::call_stub 0x00000001160d0180 Lock stack of current Java thread (top to bottom): ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2630399291 From epeter at openjdk.org Mon Feb 3 09:44:53 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 09:44:53 GMT Subject: RFR: 8346774: Use Predicate classes instead of Node classes In-Reply-To: References: Message-ID: On Wed, 22 Jan 2025 13:10:37 GMT, Christian Hagedorn wrote: > This small cleanup PR replaces a lot of usages of `Node` pointers, to pass around either the head (i.e. `IfNode`) or the tail (i.e. a success projection) of predicates, with actual `Predicate` classes. This simplifies the usages, readability and the logical flow, and enables more simplifications in the future, especially once we replace Template Assertion Predicates with a dedicated node. > > I've also included some minor refactorings like adding `const` or fixing typos. > > There are no semantic changes involved. The return value optimization should take care to avoid a lot of copies when returning new objects from methods. > > Thanks, > Christian Looks reasonable, I have a few minor suggestions. src/hotspot/share/opto/predicates.cpp line 187: > 185: } > 186: > 187: // Clone this Template Assertion Predicate and replace the input of the OpaqueLoopInitNode with 'new_opaque_input'. Looks like you now also create a `new OpaqueLoopInitNode`, so that is slightly inaccurate, right? src/hotspot/share/opto/predicates.cpp line 813: > 811: assertion_expression, > 812: template_assertion_predicate->assertion_predicate_type()); > 813: return InitializedAssertionPredicate(success_proj); Suggestion: IfTrueNode* success_proj = create_control_nodes(new_control, template_assertion_predicate->Opcode(), assertion_expression, template_assertion_predicate->assertion_predicate_type()); return InitializedAssertionPredicate(success_proj); ------------- PR Review: https://git.openjdk.org/jdk/pull/23234#pullrequestreview-2589358721 PR Review Comment: https://git.openjdk.org/jdk/pull/23234#discussion_r1939059062 PR Review Comment: https://git.openjdk.org/jdk/pull/23234#discussion_r1939065861 From epeter at openjdk.org Mon Feb 3 09:44:54 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 09:44:54 GMT Subject: RFR: 8346774: Use Predicate classes instead of Node classes In-Reply-To: References: Message-ID: <0U11j2kpTLHH10-WfekmUSG4gXr7q5sJ2Ap82SZbv5Y=.ad3d30b9-a3e5-4a7c-8e84-7ff328eaba0b@github.com> On Mon, 3 Feb 2025 09:35:06 GMT, Emanuel Peter wrote: >> This small cleanup PR replaces a lot of usages of `Node` pointers, to pass around either the head (i.e. `IfNode`) or the tail (i.e. a success projection) of predicates, with actual `Predicate` classes. This simplifies the usages, readability and the logical flow, and enables more simplifications in the future, especially once we replace Template Assertion Predicates with a dedicated node. >> >> I've also included some minor refactorings like adding `const` or fixing typos. >> >> There are no semantic changes involved. The return value optimization should take care to avoid a lot of copies when returning new objects from methods. >> >> Thanks, >> Christian > > src/hotspot/share/opto/predicates.cpp line 813: > >> 811: assertion_expression, >> 812: template_assertion_predicate->assertion_predicate_type()); >> 813: return InitializedAssertionPredicate(success_proj); > > Suggestion: > > IfTrueNode* success_proj = create_control_nodes(new_control, > template_assertion_predicate->Opcode(), > assertion_expression, > template_assertion_predicate->assertion_predicate_type()); > return InitializedAssertionPredicate(success_proj); Indentation, and: If you split the args over lines, I would at least split all of them consistently. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23234#discussion_r1939067078 From thartmann at openjdk.org Mon Feb 3 09:53:53 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Feb 2025 09:53:53 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Fri, 24 Jan 2025 20:37:32 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp @stefank just pointed out that this could be a regression from [JDK-8347564](https://bugs.openjdk.org/browse/JDK-8347564) that just went in. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2630455813 From chagedorn at openjdk.org Mon Feb 3 10:34:47 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Feb 2025 10:34:47 GMT Subject: RFR: 8346774: Use Predicate classes instead of Node classes [v2] In-Reply-To: References: Message-ID: > This small cleanup PR replaces a lot of usages of `Node` pointers, to pass around either the head (i.e. `IfNode`) or the tail (i.e. a success projection) of predicates, with actual `Predicate` classes. This simplifies the usages, readability and the logical flow, and enables more simplifications in the future, especially once we replace Template Assertion Predicates with a dedicated node. > > I've also included some minor refactorings like adding `const` or fixing typos. > > There are no semantic changes involved. The return value optimization should take care to avoid a lot of copies when returning new objects from methods. > > Thanks, > Christian Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Review Emanuel - Merge branch 'master' into JDK-8346774 - more cleanups - more cleanups - 8346774: Use Predicate classes instead of Node classes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23234/files - new: https://git.openjdk.org/jdk/pull/23234/files/a44280ba..531763da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23234&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23234&range=00-01 Stats: 39199 lines in 2918 files changed: 17712 ins; 11939 del; 9548 mod Patch: https://git.openjdk.org/jdk/pull/23234.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23234/head:pull/23234 PR: https://git.openjdk.org/jdk/pull/23234 From chagedorn at openjdk.org Mon Feb 3 10:34:48 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Feb 2025 10:34:48 GMT Subject: RFR: 8346774: Use Predicate classes instead of Node classes [v2] In-Reply-To: <0U11j2kpTLHH10-WfekmUSG4gXr7q5sJ2Ap82SZbv5Y=.ad3d30b9-a3e5-4a7c-8e84-7ff328eaba0b@github.com> References: <0U11j2kpTLHH10-WfekmUSG4gXr7q5sJ2Ap82SZbv5Y=.ad3d30b9-a3e5-4a7c-8e84-7ff328eaba0b@github.com> Message-ID: <9nhku_9kXziy6MjCf3X-j_Dgzxe3gFekXOpOgshDoXI=.609317ec-7ff6-42d5-8cbf-2350278325ed@github.com> On Mon, 3 Feb 2025 09:35:57 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/predicates.cpp line 813: >> >>> 811: assertion_expression, >>> 812: template_assertion_predicate->assertion_predicate_type()); >>> 813: return InitializedAssertionPredicate(success_proj); >> >> Suggestion: >> >> IfTrueNode* success_proj = create_control_nodes(new_control, >> template_assertion_predicate->Opcode(), >> assertion_expression, >> template_assertion_predicate->assertion_predicate_type()); >> return InitializedAssertionPredicate(success_proj); > > Indentation, and: If you split the args over lines, I would at least split all of them consistently. I usually just make sure that long lines split but don't enforce a single arg per line. But I don't mind adapting to that here :-) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23234#discussion_r1939148365 From aph at openjdk.org Mon Feb 3 10:58:48 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 3 Feb 2025 10:58:48 GMT Subject: RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64 In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 16:48:09 GMT, Jamil Nimeh wrote: > This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64. Assembly language profiling yielded an 11% improvement in throughput. When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement. This looks very nice, and I'm tempted to just approve it as it is. My only concern is that the algorithm changes aren't really explained, but I guess what you have done here is the _128-Bit Vectorization_ in `https://eprint.iacr.org/2013/759.pdf`. Is that right? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2630610061 From epeter at openjdk.org Mon Feb 3 11:33:49 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 11:33:49 GMT Subject: RFR: 8346774: Use Predicate classes instead of Node classes [v2] In-Reply-To: References: Message-ID: <1tN6weAbqX2ycbIFKRhPKx_TqnouGIFCfYXNxWq6YnY=.b6c99a14-320b-4e33-b3bc-8fd6f8a6adcd@github.com> On Mon, 3 Feb 2025 10:34:47 GMT, Christian Hagedorn wrote: >> This small cleanup PR replaces a lot of usages of `Node` pointers, to pass around either the head (i.e. `IfNode`) or the tail (i.e. a success projection) of predicates, with actual `Predicate` classes. This simplifies the usages, readability and the logical flow, and enables more simplifications in the future, especially once we replace Template Assertion Predicates with a dedicated node. >> >> I've also included some minor refactorings like adding `const` or fixing typos. >> >> There are no semantic changes involved. The return value optimization should take care to avoid a lot of copies when returning new objects from methods. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Review Emanuel > - Merge branch 'master' into JDK-8346774 > - more cleanups > - more cleanups > - 8346774: Use Predicate classes instead of Node classes Looks good now! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23234#pullrequestreview-2589633891 From chagedorn at openjdk.org Mon Feb 3 11:41:55 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Feb 2025 11:41:55 GMT Subject: RFR: 8346774: Use Predicate classes instead of Node classes [v2] In-Reply-To: References: Message-ID: <8WOon2jY2Wy-0Coistl31AYuVcLnM8NiFwU7gexKLXA=.e051acf1-56ea-46d3-9792-897fbb72cb53@github.com> On Mon, 3 Feb 2025 10:34:47 GMT, Christian Hagedorn wrote: >> This small cleanup PR replaces a lot of usages of `Node` pointers, to pass around either the head (i.e. `IfNode`) or the tail (i.e. a success projection) of predicates, with actual `Predicate` classes. This simplifies the usages, readability and the logical flow, and enables more simplifications in the future, especially once we replace Template Assertion Predicates with a dedicated node. >> >> I've also included some minor refactorings like adding `const` or fixing typos. >> >> There are no semantic changes involved. The return value optimization should take care to avoid a lot of copies when returning new objects from methods. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Review Emanuel > - Merge branch 'master' into JDK-8346774 > - more cleanups > - more cleanups > - 8346774: Use Predicate classes instead of Node classes Thanks Emanuel for your review! I'm running some testing again with latest master and will then integrate it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23234#issuecomment-2630703920 From aph at openjdk.org Mon Feb 3 11:44:56 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 3 Feb 2025 11:44:56 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Fri, 24 Jan 2025 20:37:32 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1422: > 1420: bool force_movk = true; // movk is important if the target can be more than 4GB away > 1421: adrp(dest, const_addr, offset, force_movk); > 1422: ldr(dest, Address(dest, offset)); I wonder if this really is the best way to do it. It's not clear to me that there is any advantage of using `adrp` in this case rather than a simple `mov(scratch, const_adr); ldr(dest, Address(scratch);`. The `mov` would produce `movz; movk; movk` which almost certainly execute in a single cycle, then a load without an offset, which is a single micro-op rather than two micro-ops for load+offset. All we've gained for this complication is a small reduction in code density rather than a performance improvement. I'd go with simplicity. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1939240122 From mli at openjdk.org Mon Feb 3 12:32:48 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 3 Feb 2025 12:32:48 GMT Subject: RFR: 8347489: RISC-V: Misaligned memory access with COH [v7] In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 03:04:05 GMT, Fei Yang wrote: >> Hi, please consider this change. >> >> We have different base_offset for T_BYTE/T_CHAR (4-byte instead of 8-byte aligned) with COH. This causes misaligned memory accesses for several instrinsics like String.Compare or String.Equals. The reason is that we assume 8-byte alignment and process one 8-byte word starting at the first array element for each iteration in the main loop. As a result, we have performance regressions on platforms with slow misaligned memory accesses like Unmatched and Premier P550 SBCs. >> >> PS: Same issue is there even without COH. base_offset for T_BYTE/T_CHAR is 20 (thus 4-byte aligned) when `UseCompressedClassPointers` is disabled in this case. >> >> Correctness test on linux-riscv64: >> - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (release) >> - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (release) >> - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (fastdebug) >> - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (fastdebug) >> >> Performance test on Premier P550 (-XX:+AlwaysPreTouch -Xms8g -Xmx8g): >> >> SPECjbb2005: >> >> 1. Without Patch >> 1.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32666 >> 1.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 27610 >> 1.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30911 >> 1.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 26008 >> >> 2. With Patch >> 2.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32820 >> 2.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 34179 >> 2.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30620 >> 2.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 31936 >> >> >> SPECjbb2015: >> >> 1. Without Patch >> 1.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1444, critical-jOPS = 431 >> 1.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1092, critical-jOPS = 335 >> >> 2. With Patch >> 2.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1452, critical-jOPS = 419 >> 2.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1438, critical-jOPS = 477 > > Fei Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Merge branch 'master' into JDK-8347489 > - Review comment > - Review comment > - Merge branch 'master' into JDK-8347489 > - Merge branch 'master' into JDK-8347489 > - Comment > - Fix assertions > - Add assertions > - Comment > - 8347489: RISC-V: Misaligned memory access with COH Looks good, thanks for updating. Just one minor comment. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1463: > 1461: { > 1462: if (str1_isL == str2_isL) { // LL or UU > 1463: #ifdef ASSERT Can we add the some comment about 8-bytes alignment below at line 1520? I know it's redundant, but as the code is getting more complicated, it might be good to add it to improve the readability. ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23053#pullrequestreview-2589753525 PR Review Comment: https://git.openjdk.org/jdk/pull/23053#discussion_r1939298480 From stefank at openjdk.org Mon Feb 3 12:38:48 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Mon, 3 Feb 2025 12:38:48 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Mon, 3 Feb 2025 09:50:58 GMT, Tobias Hartmann wrote: > @stefank just pointed out that this could be a regression from [JDK-8347564](https://bugs.openjdk.org/browse/JDK-8347564) that just went in. I thought that it could have been a regression but after looking closer I don't think it is. However, I do see a problem that the code removes the GC barriers when loading oops. I'll add an inline comment. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2630828942 From stefank at openjdk.org Mon Feb 3 12:46:17 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Mon, 3 Feb 2025 12:46:17 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Fri, 24 Jan 2025 20:37:32 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp Changes requested by stefank (Reviewer). src/hotspot/share/code/nmethod.cpp line 2162: > 2160: return nullptr; > 2161: } > 2162: return RawAccess<>::oop_load(oop_addr_at(index)); This change is removing the GC barriers and is likely the cause of the ZGC crash that Tobias listed. However, the fix is not as simple as to just reinstate the NMethodAccess call. The ZGC code uses the `oop*` to find the associated `nmethod` in the code cache. We need another way to fetch the nmethod now. So, I'm experimenting with a small change to switch out the Access API call to a direct GC barrier set call and then I pass down the `this` pointer from this function. With that you should be able skip this change. With that said, what was the motivation for changing this? ------------- PR Review: https://git.openjdk.org/jdk/pull/21276#pullrequestreview-2589786147 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1939317498 From aph at openjdk.org Mon Feb 3 13:56:58 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 3 Feb 2025 13:56:58 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Fri, 24 Jan 2025 20:37:32 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 328: > 326: > 327: // Maybe we have a third instruction: adrp + movk + ldr with offset (e.g. ldr_patchable) > 328: uint32_t insn3 = insn_at(insn_addr, 2); As discussed below, it's better not to do this. `adrp()` was never intended to be used in this way, and it's not so efficient to use adrp + movk + ldr with offset rather than movz; movk; movk; ldr (no offset). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1939415174 From adinn at openjdk.org Mon Feb 3 14:19:51 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 3 Feb 2025 14:19:51 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Mon, 3 Feb 2025 11:42:14 GMT, Andrew Haley wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Force the use of movk in combination with adrp and ldr instructions to address scenarios >> where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1422: > >> 1420: bool force_movk = true; // movk is important if the target can be more than 4GB away >> 1421: adrp(dest, const_addr, offset, force_movk); >> 1422: ldr(dest, Address(dest, offset)); > > I wonder if this really is the best way to do it. It's not clear to me that there is any advantage of using `adrp` in this case rather than a simple `mov(scratch, const_adr); ldr(dest, Address(scratch);`. The `mov` would produce `movz; movk; movk` which almost certainly execute in a single cycle, then a load without an offset, which is a single micro-op rather than two micro-ops for load+offset. All we've gained for this complication is a small reduction in code density rather than a performance improvement. I'd go with simplicity. Yes, I agree. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1939450267 From jbhateja at openjdk.org Mon Feb 3 14:21:31 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 3 Feb 2025 14:21:31 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: > Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature. > Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations. > > Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence. > > Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction. > > Following are the performance numbers of the following existing microbenchmark > https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java > > Patch passes following validation test > [test/jdk/java/lang/Math/IeeeRecommendedTests.java > ](https://github.com/openjdk/jdk/blob/master/test/jdk/java/lang/Math/IeeeRecommendedTests.java) > > > Granite Rapids-AP (P-core Xeon) > Baseline AVX512: > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 1296.141 ops/ns > Signum._7_copySignDoubleTest thrpt 2 838.954 ops/ns > > Withopt : > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 940.240 ops/ns > Signum._7_copySignDoubleTest thrpt 2 967.370 ops/ns > > Baseline AVX2: > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 63.673 ops/ns > Signum._7_copySignDoubleTest thrpt 2 26.898 ops/ns > > Withopt : > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 785.801 ops/ns > Signum._7_copySignDoubleTest thrpt 2 558.710 ops/ns > > Sierra Forest (E-core Xeon) > Baseline: > Benchmark (seed) Mode Cnt Score Error Units > o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 40.528 ops/ns > o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 25.101 ops/ns > > Withopt: > Benchmark (seed) Mode Cnt Score Error Units > o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 676.101 ops/ns > o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 ... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Adding IR framework verification test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23386/files - new: https://git.openjdk.org/jdk/pull/23386/files/d620eb65..2181850d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23386&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23386&range=00-01 Stats: 166 lines in 3 files changed: 154 ins; 0 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/23386.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23386/head:pull/23386 PR: https://git.openjdk.org/jdk/pull/23386 From jbhateja at openjdk.org Mon Feb 3 14:21:32 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 3 Feb 2025 14:21:32 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 14:17:26 GMT, Jatin Bhateja wrote: >> Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature. >> Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations. >> >> Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence. >> >> Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction. >> >> Following are the performance numbers of the following existing microbenchmark >> https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java >> >> Patch passes following validation test >> [test/jdk/java/lang/Math/IeeeRecommendedTests.java >> ](https://github.com/openjdk/jdk/blob/master/test/jdk/java/lang/Math/IeeeRecommendedTests.java) >> >> >> Granite Rapids-AP (P-core Xeon) >> Baseline AVX512: >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 1296.141 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 838.954 ops/ns >> >> Withopt : >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 940.240 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 967.370 ops/ns >> >> Baseline AVX2: >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 63.673 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 26.898 ops/ns >> >> Withopt : >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 785.801 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 558.710 ops/ns >> >> Sierra Forest (E-core Xeon) >> Baseline: >> Benchmark (seed) Mode Cnt Score Error Units >> o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 40.528 ops/ns >> o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 25.101 ops/ns >> >> Withopt: >> Benchmark (seed) Mode Cnt Score Error Units >> o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 676.... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Adding IR framework verification test src/hotspot/cpu/x86/x86.ad line 6745: > 6743: format %{ "CopySignF $dst, $src\t! using $xtmp1 as TEMP" %} > 6744: ins_encode %{ > 6745: __ vpcmpeqd($xtmp1$$XMMRegister, $xtmp1$$XMMRegister, $xtmp1$$XMMRegister, Assembler::AVX_128bit); If any of the vector operands is from a higher register bank (16-31) then we need an EVEX encoding and in such a case, the results of the comparison is always an opmask register. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23386#discussion_r1939266632 From jbhateja at openjdk.org Mon Feb 3 14:21:32 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 3 Feb 2025 14:21:32 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Sat, 1 Feb 2025 22:19:57 GMT, Jasmine Karthikeyan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding IR framework verification test > > src/hotspot/cpu/x86/x86.ad line 6769: > >> 6767: >> 6768: instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{ >> 6769: predicate(!VM_Version::supports_avx512vl()); > > Suggestion: > > predicate(UseAVX > 0 && !VM_Version::supports_avx512vl()); > > Just to be a bit more explicit (and same for the one below). Its already handled by match_rule_supported contraint. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23386#discussion_r1939268134 From qamai at openjdk.org Mon Feb 3 14:39:51 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 3 Feb 2025 14:39:51 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 14:21:31 GMT, Jatin Bhateja wrote: >> Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature. >> Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations. >> >> Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence. >> >> Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction. >> >> Following are the performance numbers of the following existing microbenchmark >> https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java >> >> Patch passes following validation test >> [test/jdk/java/lang/Math/IeeeRecommendedTests.java >> ](https://github.com/openjdk/jdk/blob/master/test/jdk/java/lang/Math/IeeeRecommendedTests.java) >> >> >> Granite Rapids-AP (P-core Xeon) >> Baseline AVX512: >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 1296.141 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 838.954 ops/ns >> >> Withopt : >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 940.240 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 967.370 ops/ns >> >> Baseline AVX2: >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 63.673 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 26.898 ops/ns >> >> Withopt : >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 785.801 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 558.710 ops/ns >> >> Sierra Forest (E-core Xeon) >> Baseline: >> Benchmark (seed) Mode Cnt Score Error Units >> o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 40.528 ops/ns >> o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 25.101 ops/ns >> >> Withopt: >> Benchmark (seed) Mode Cnt Score Error Units >> o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 676.... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Adding IR framework verification test Could you instead do this by trying to transform `AndI(MoveF2I(x), MoveF2I(y))` into `AndF(x, y)` instead? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2631182729 From roland at openjdk.org Mon Feb 3 14:43:30 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 3 Feb 2025 14:43:30 GMT Subject: RFR: 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion [v2] In-Reply-To: References: Message-ID: > I investigated the failure from the `Test.java` that's attached to the > bug. The failure with this test is only reproducible up to 8334060 > (Implementation of Late Barrier Expansion for G1) so experiments I > describe here are from the source code for the commit right before it. > > Peak malloc memory usage reported by NMT is: 1.3GB > > `PhaseCFG::global_code_motion()`, when `OptoRegScheduling` is true, > creates a `PhaseIFG` that's, when initialized, allocates `_adjs`: a > `maxlrg` array of `IndexSet`s that can contain up to `maxlrg`. > > `maxlrg` in this case is 122839. An `IndexSet` is an array of pointers > to a 256 bit bitset: one `IndexSet` array needs: > > > 122839 / 256 * 8 = 3832 > > > and there are of 122839: > > > 3832 * 122839 = ~470 MB > > > It turns out the `PhaseIFG` object when used from > `PhaseCFG::global_code_motion()` doesn't even use the `_adjs` > array. So a patch like: > > > diff --git a/src/hotspot/share/opto/chaitin.hpp b/src/hotspot/share/opto/chaitin.hpp > index cf02deb6019..4e5333bf181 100644 > --- a/src/hotspot/share/opto/chaitin.hpp > +++ b/src/hotspot/share/opto/chaitin.hpp > @@ -258,7 +258,7 @@ class PhaseIFG : public Phase { > VectorSet *_yanked; > > PhaseIFG( Arena *arena ); > - void init( uint maxlrg ); > + void init( uint maxlrg, bool no_adjs = false ); > > // Add edge between a and b. Returns true if actually added. > int add_edge( uint a, uint b ); > diff --git a/src/hotspot/share/opto/gcm.cpp b/src/hotspot/share/opto/gcm.cpp > index ebdefe597ff..fefd75a88c5 100644 > --- a/src/hotspot/share/opto/gcm.cpp > +++ b/src/hotspot/share/opto/gcm.cpp > @@ -1704,7 +1704,9 @@ void PhaseCFG::global_code_motion() { > rm_live.reset_to_mark(); // Reclaim working storage > IndexSet::reset_memory(C, &live_arena); > uint node_size = regalloc._lrg_map.max_lrg_id(); > - ifg.init(node_size); // Empty IFG > + ifg.init(node_size, true); // Empty IFG > regalloc.set_ifg(ifg); > regalloc.set_live(live); > regalloc.gather_lrg_masks(false); // Collect LRG masks > diff --git a/src/hotspot/share/opto/ifg.cpp b/src/hotspot/share/opto/ifg.cpp > index d12698121b9..e42121c2254 100644 > --- a/src/hotspot/share/opto/ifg.cpp > +++ b/src/hotspot/share/opto/ifg.cpp > @@ -42,18 +42,24 @@ > PhaseIFG::PhaseIFG( Arena *arena ) : Phase(Interference_Graph), _arena(arena) { > } > > -void PhaseIFG::init( uint maxlrg ) { > +void PhaseIFG::init( uint maxlrg, bool no_adjs ) { > _maxlrg = maxlrg; > _yanked = new (_arena) VectorSet(_arena); > _is_square = false; > // Make uninitialized adjacency lists > - ... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - review - Merge branch 'master' into JDK-8333697 - fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23075/files - new: https://git.openjdk.org/jdk/pull/23075/files/7bdb8c41..a96aa572 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23075&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23075&range=00-01 Stats: 60570 lines in 3664 files changed: 28295 ins; 16492 del; 15783 mod Patch: https://git.openjdk.org/jdk/pull/23075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23075/head:pull/23075 PR: https://git.openjdk.org/jdk/pull/23075 From roland at openjdk.org Mon Feb 3 14:56:49 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 3 Feb 2025 14:56:49 GMT Subject: RFR: 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 14:43:30 GMT, Roland Westrelin wrote: >> I investigated the failure from the `Test.java` that's attached to the >> bug. The failure with this test is only reproducible up to 8334060 >> (Implementation of Late Barrier Expansion for G1) so experiments I >> describe here are from the source code for the commit right before it. >> >> Peak malloc memory usage reported by NMT is: 1.3GB >> >> `PhaseCFG::global_code_motion()`, when `OptoRegScheduling` is true, >> creates a `PhaseIFG` that's, when initialized, allocates `_adjs`: a >> `maxlrg` array of `IndexSet`s that can contain up to `maxlrg`. >> >> `maxlrg` in this case is 122839. An `IndexSet` is an array of pointers >> to a 256 bit bitset: one `IndexSet` array needs: >> >> >> 122839 / 256 * 8 = 3832 >> >> >> and there are of 122839: >> >> >> 3832 * 122839 = ~470 MB >> >> >> It turns out the `PhaseIFG` object when used from >> `PhaseCFG::global_code_motion()` doesn't even use the `_adjs` >> array. So a patch like: >> >> >> diff --git a/src/hotspot/share/opto/chaitin.hpp b/src/hotspot/share/opto/chaitin.hpp >> index cf02deb6019..4e5333bf181 100644 >> --- a/src/hotspot/share/opto/chaitin.hpp >> +++ b/src/hotspot/share/opto/chaitin.hpp >> @@ -258,7 +258,7 @@ class PhaseIFG : public Phase { >> VectorSet *_yanked; >> >> PhaseIFG( Arena *arena ); >> - void init( uint maxlrg ); >> + void init( uint maxlrg, bool no_adjs = false ); >> >> // Add edge between a and b. Returns true if actually added. >> int add_edge( uint a, uint b ); >> diff --git a/src/hotspot/share/opto/gcm.cpp b/src/hotspot/share/opto/gcm.cpp >> index ebdefe597ff..fefd75a88c5 100644 >> --- a/src/hotspot/share/opto/gcm.cpp >> +++ b/src/hotspot/share/opto/gcm.cpp >> @@ -1704,7 +1704,9 @@ void PhaseCFG::global_code_motion() { >> rm_live.reset_to_mark(); // Reclaim working storage >> IndexSet::reset_memory(C, &live_arena); >> uint node_size = regalloc._lrg_map.max_lrg_id(); >> - ifg.init(node_size); // Empty IFG >> + ifg.init(node_size, true); // Empty IFG >> regalloc.set_ifg(ifg); >> regalloc.set_live(live); >> regalloc.gather_lrg_masks(false); // Collect LRG masks >> diff --git a/src/hotspot/share/opto/ifg.cpp b/src/hotspot/share/opto/ifg.cpp >> index d12698121b9..e42121c2254 100644 >> --- a/src/hotspot/share/opto/ifg.cpp >> +++ b/src/hotspot/share/opto/ifg.cpp >> @@ -42,18 +42,24 @@ >> PhaseIFG::PhaseIFG( Arena *arena ) : Phase(Interference_Graph), _arena(arena) { >> } >> >> -void PhaseIFG::init( uint maxlrg ) { >> +void PhaseIFG::init( uint maxlrg, bool no_adjs ) { >> ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-8333697 > - fix Thanks for reviewing this. > Could you please merge with master? Done. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23075#issuecomment-2631227463 From roland at openjdk.org Mon Feb 3 14:56:50 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 3 Feb 2025 14:56:50 GMT Subject: RFR: 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion [v2] In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 09:11:17 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - review >> - Merge branch 'master' into JDK-8333697 >> - fix > > src/hotspot/share/opto/indexSet.hpp line 333: > >> 331: void initialize(uint max_element, Arena *arena); >> 332: >> 333: void initialize_if_needed() { > > A comment would be nice. I added a comment with the new commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23075#discussion_r1939515863 From aph at openjdk.org Mon Feb 3 15:28:53 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 3 Feb 2025 15:28:53 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Fri, 24 Jan 2025 20:37:32 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp When using Shenadoah I am still seeing embedded compressed OOPs in the instruction stream. Is that correct? These form two instructions, like so: movz(dst, 0xDEAD, 16); movk(dst, 0xBEEF); Do you want compressed OOPs to be moved out of CodeCache as well as uncompressed OOPs? If so, you should change `loadConNNode`in C2. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2631324482 From jpai at openjdk.org Mon Feb 3 15:53:17 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 15:53:17 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null Message-ID: Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. This backout was done using `git revert 5890d9438bbde88b89070052926a2eafe13d7b42`, but the the revert wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. ------------- Commit messages: - Revert "8333893: Optimization for StringBuilder append boolean & null" Changes: https://git.openjdk.org/jdk/pull/23420/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23420&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349183 Stats: 133 lines in 5 files changed: 18 ins; 79 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/23420.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23420/head:pull/23420 PR: https://git.openjdk.org/jdk/pull/23420 From jnimeh at openjdk.org Mon Feb 3 16:16:46 2025 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Mon, 3 Feb 2025 16:16:46 GMT Subject: RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64 In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 10:56:28 GMT, Andrew Haley wrote: >> This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64. Assembly language profiling yielded an 11% improvement in throughput. When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement. > > This looks very nice, and I'm tempted to just approve it as it is. My only concern is that the algorithm changes aren't really explained, but I guess what you have done here is the _128-Bit Vectorization_ in `https://eprint.iacr.org/2013/759.pdf`. Is that right? Hi @theRealAph, thanks for taking a look at the changes. Actually, I hadn't read that paper believe it or not, though it certainly looks like what I'm doing. When I was prototyping this a few years ago in x86_64 assembly it just occurred to me that directly loading the state into 4 consecutive vectors had everything align in the columnar organization such that we could do the double-round right off the bat (which must have been intentional in the design of the cipher). Then all we had to do was do the lane rotation leftward 1/2/3 for the second/third/fourth vectors and do the quarter rounds again. And longer vectors for AVX2, AVX-512, or using more vectors like on aarch64 allowed me to do more blocks at one time. It was long enough ago that I don't recall exactly why, but when I did the quarter-round and block parallel versions in aarch64 assembly initially, it seemed like they were pretty comparable in speed. At the time, I interpreted that to mean that the gains in block-parallel by not needing to do lane rotations was offset by having to gather/scatter each 32-bit state value from different vectors at 64-byte offsets, and it more or less balanced out. When I went back to look at these two approaches, I think I was able to tweak things on the quarter-round parallel version to make loads and stores a little friendlier (one ld1 and a bunch of register-to-register movs vs. 4 ld4r ops, and 4 st1 ops vs. 16 st4 ops). In terms of explaining the algorithm changes, I could add some comment text to the header of the stub function that better explains the general idea behind what is being done. It would certainly help anyone maintaining it down the line (myself included). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2631451246 From jpai at openjdk.org Mon Feb 3 16:37:09 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 16:37:09 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: > Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? > > The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. > > The backout was done as follows, using `git revert` against the 2 relevant commits: > > > git revert 74ae3c688b37e693e20eb4e17c631897c5464400 > git revert 5890d9438bbde88b89070052926a2eafe13d7b42 > > The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. > > tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: - Revert "8333893: Optimization for StringBuilder append boolean & null" This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23420/files - new: https://git.openjdk.org/jdk/pull/23420/files/d406bc1d..23d01f0a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23420&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23420&range=00-01 Stats: 36 lines in 1 file changed: 28 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23420.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23420/head:pull/23420 PR: https://git.openjdk.org/jdk/pull/23420 From swen at openjdk.org Mon Feb 3 17:06:47 2025 From: swen at openjdk.org (Shaojin Wen) Date: Mon, 3 Feb 2025 17:06:47 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. This concurrency problem also exists in the UTF16 scenario, so why only change to Latin1 here? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631574736 From jpai at openjdk.org Mon Feb 3 17:12:50 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 17:12:50 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 17:04:33 GMT, Shaojin Wen wrote: > This concurrency problem also exists in the UTF16 scenario, so why only change to Latin1 here? Do you mean there are additional commits that have been done in the JDK which introduce a similar issue related to array writes beyond their limit that need to be backed out? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631588160 From redestad at openjdk.org Mon Feb 3 17:39:48 2025 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 3 Feb 2025 17:39:48 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. Marked as reviewed by redestad (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23420#pullrequestreview-2590593462 From liach at openjdk.org Mon Feb 3 17:39:49 2025 From: liach at openjdk.org (Chen Liang) Date: Mon, 3 Feb 2025 17:39:49 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. Also for append(char): it uses putCharSB which does a check and calls unchecked putChar. putCharsAt is always checked. src/java.base/share/classes/java/lang/StringUTF16.java line 1538: > 1536: public static int putCharsAt(byte[] value, int i, char c1, char c2, char c3, char c4) { > 1537: int end = i + 4; > 1538: checkBoundsBeginEnd(i, end, value); We have this explicit check for null and true, false has another bound check. This backout version should be safe. ------------- Marked as reviewed by liach (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23420#pullrequestreview-2590596912 PR Review Comment: https://git.openjdk.org/jdk/pull/23420#discussion_r1939778610 From redestad at openjdk.org Mon Feb 3 17:39:49 2025 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 3 Feb 2025 17:39:49 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 17:04:33 GMT, Shaojin Wen wrote: >> Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: >> >> - Revert "8333893: Optimization for StringBuilder append boolean & null" >> >> This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. >> - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" >> >> This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. > > This concurrency problem also exists in the UTF16 scenario, so why only change to Latin1 here? @wenshao this re-instates the `checkBoundsBeginEnd(i, end, value);` in the UTF16 case that was removed by the issue being backed out, so we get back to a state where we have appropriate bounds checks on all array accesses. An alternative to this backout would be to add `checkBoundsBeginEnd(i, end, value);` to all the `putCharsAt` methods, though it's unclear if that would undo the performance advantage. Better then to backout and - if possible - redo with a closer examination of the performance with a safer construct. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631644364 From jpai at openjdk.org Mon Feb 3 17:45:57 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 17:45:57 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: <6XAQw2AHZiOnS2lGtFhGOhqMn1ebX7zCer7DUeBQSXQ=.b990b7ad-3bae-4e0d-8b8a-842d081ab827@github.com> On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. Thank you Claes and Chen for the reviews. tier1, tier2 and tier3 testing is nearing completion without any failures. Could one of you approve that it's OK to integrate this trivial backout without waiting for 24 hours (the review process allows it https://openjdk.org/guide/#trivial-changes)? I will be integrating this as soon as the tier testing completes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631660611 From swen at openjdk.org Mon Feb 3 17:52:46 2025 From: swen at openjdk.org (Shaojin Wen) Date: Mon, 3 Feb 2025 17:52:46 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. Can you wait for me for a while? I am looking for other solutions that do not require a fallback. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631675846 From dlunden at openjdk.org Mon Feb 3 17:56:51 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 3 Feb 2025 17:56:51 GMT Subject: [jdk24] RFR: 8348658: [AArch64] The node limit in compiler/codegen/TestMatcherClone.java is too strict In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 16:40:21 GMT, Aleksey Shipilev wrote: >> Hi all, >> >> This pull request contains a backport of commit [ee87d187](https://github.com/openjdk/jdk/commit/ee87d187d1cab09317b4f0068bfafc68efbbfe56) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. >> >> The commit being backported was authored by Daniel Lund?n on 31 Jan 2025 and was reviewed by Aleksey Shipilev and Vladimir Kozlov. >> >> Thanks! > > There is no rush to do this in JDK 24 GA, but this is also a test-only change, so it formally passes the bar for RDP2. Thanks for the review @shipilev. Right, I thought it made sense to backport to 24 since it is such a small (test) change and we are still in RDP2. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23390#issuecomment-2631683753 From jpai at openjdk.org Mon Feb 3 18:00:48 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 18:00:48 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: <91DwI-6uK7tLyI1MCemdvDbDgaakD763G8X4V8OwgR4=.466cec61-85f5-4776-979f-ae0b43d23bce@github.com> On Mon, 3 Feb 2025 17:50:23 GMT, Shaojin Wen wrote: > Can you wait for me for a while? I am looking for other solutions that do not require a fallback. The updated fix/change (if any) doesn't have to be rushed and you can take longer to work on it with additional help and reviews from others. Once this backout is integrated into mainline, it will be backported to jdk24 (which is to be released in a few days). The risk of waiting for an additional/different fix is higher compared to a backout. So we intend to go ahead with the backout. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631690902 From swen at openjdk.org Mon Feb 3 18:05:48 2025 From: swen at openjdk.org (Shaojin Wen) Date: Mon, 3 Feb 2025 18:05:48 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. https://github.com/openjdk/jdk/pull/23423 I submitted another PR where I reproduced the issue locally and this PR fixed the issue. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631702737 From duke at openjdk.org Mon Feb 3 18:09:51 2025 From: duke at openjdk.org (Abdelhak Zaaim) Date: Mon, 3 Feb 2025 18:09:51 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: <4I1GV0W5jzwRthub-EQO7Cg329lQuJArIDqfVFnPRbc=.d7c5c80f-f179-42a2-ad3e-422fbdc37e8c@github.com> On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. Marked as reviewed by abdelhak-zaaim at github.com (no known OpenJDK username). ------------- PR Review: https://git.openjdk.org/jdk/pull/23420#pullrequestreview-2590656730 From kvn at openjdk.org Mon Feb 3 18:09:51 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 3 Feb 2025 18:09:51 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <-c7xXeuSN-6QD-k6MA1-7Cv17ztENnt7Q0U6PprRrf0=.afd866ca-f6d8-4f09-8c66-ada7bd38c67f@github.com> Message-ID: On Mon, 3 Feb 2025 07:05:08 GMT, Emanuel Peter wrote: >> Not completely. >> >> The issue is we should avoid creating new irreducible loops. >> What if `l` is already marked as irreducible `l->_irreducible = 1` by following code? I don't see check above for such case. >> And again come here but secondary_entry can be irreducible (it is Region for example or already marked) and we skip bailout. Why it is okay? > >> The issue is we should avoid creating new irreducible loops. > > Absolutely. This is not a fix at all. But since we can catch that the state is wrong here, we should bail out instead of continuing on in production. That is at least a little improvement. The bug-fix would come in a second step, and may be much more complicated as it would have to reconsider what to do about `split_if`. Do we postpone until after loop-opts for example? Before considering such an involved fix, I would rather want to do this "defensive" patch first. Does that make sense? This also allows us to integrate [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570), which is currently blocked by the failing assert (which is now disabled, but bailout instead, can be enabled with the flag). > >> What if l is already marked as irreducible l->_irreducible = 1 by following code? I don't see check above for such case. > > If `l->_irreducible = 1`, then the corresponding node should have been marked as `MaybeIrreducibleEntry`, and so should also `m` be marked with `MaybeIrreducibleEntry`. If that is the case, we are fine, these `Region` were created at parsing and we currently consider that fine. > > But if one of the `Region` of the now irreducible loop was not marked accordingly already during parsing, then a new irreducible loop appeared during compilation - and that's not good. > >> And again come here but secondary_entry can be irreducible (it is Region for example or already marked) and we skip bailout. Why it is okay? > > If we marked a `Region` with `MaybeIrreducibleEntry`, then we treat it differently in some optimizations. For example, when the region loses a control input, we have to check if the loop is now dead, with a global connectivity search. For `LoopNode` losing the control input means we already know that the loop is dead, since that was the only loop entry. But for irreducible loops, losing one entry means we do not know if there is still a secondary entry or not. For context see [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126). > > Does that answer your question? If not, we may have to talk about it offline ;) > > PS: > The whole irreducible loop handling is still broken actually, see [JDK-8308675](https://bugs.openjdk.org/browse/JDK-8308675). We plan to eventually also disallow creation of irreducible loops at parsing. But that's a big project, and we were hoping for a student project. But we still also should not introduce new irreducible loops during parsing, and that's what... First, I am fine with this "band-aid" change. I understand that it simple replaces assert with bailout which is fine. But I am trying to understand what it does. There are few states when we come to this part of code: - `l` is or not marked as irreducible - `m` is or not marked with MaybeIrreducibleEntry (is it set only for not Loop?) - `m` is or not Loop So we have 8 combinations. I would like to hear reasons in which cases we should bailout and in which not. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1939815495 From jwilhelm at openjdk.org Mon Feb 3 18:14:52 2025 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Mon, 3 Feb 2025 18:14:52 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: <6XAQw2AHZiOnS2lGtFhGOhqMn1ebX7zCer7DUeBQSXQ=.b990b7ad-3bae-4e0d-8b8a-842d081ab827@github.com> References: <6XAQw2AHZiOnS2lGtFhGOhqMn1ebX7zCer7DUeBQSXQ=.b990b7ad-3bae-4e0d-8b8a-842d081ab827@github.com> Message-ID: On Mon, 3 Feb 2025 17:43:12 GMT, Jaikiran Pai wrote: > Thank you Claes and Chen for the reviews. tier1, tier2 and tier3 testing is nearing completion without any failures. Could one of you approve that it's OK to integrate this trivial backout without waiting for 24 hours (the review process allows it https://openjdk.org/guide/#trivial-changes)? I will be integrating this as soon as the tier testing completes. A backout is always considered a trivial change. See https://openjdk.org/guide/#backing-out-a-change ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631720915 From jbhateja at openjdk.org Mon Feb 3 18:14:56 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 3 Feb 2025 18:14:56 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java > > Co-authored-by: Emanuel Peter Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2631719276 From epeter at openjdk.org Mon Feb 3 18:18:50 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 18:18:50 GMT Subject: [jdk24] RFR: 8348658: [AArch64] The node limit in compiler/codegen/TestMatcherClone.java is too strict In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 14:25:30 GMT, Daniel Lund?n wrote: > Hi all, > > This pull request contains a backport of commit [ee87d187](https://github.com/openjdk/jdk/commit/ee87d187d1cab09317b4f0068bfafc68efbbfe56) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Daniel Lund?n on 31 Jan 2025 and was reviewed by Aleksey Shipilev and Vladimir Kozlov. > > Thanks! Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23390#pullrequestreview-2590673337 From swen at openjdk.org Mon Feb 3 18:24:50 2025 From: swen at openjdk.org (Shaojin Wen) Date: Mon, 3 Feb 2025 18:24:50 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. Don't rush to roll back, the current rollback solution still has problems ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631734673 From dlunden at openjdk.org Mon Feb 3 18:24:52 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 3 Feb 2025 18:24:52 GMT Subject: [jdk24] RFR: 8348658: [AArch64] The node limit in compiler/codegen/TestMatcherClone.java is too strict In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 18:16:16 GMT, Emanuel Peter wrote: >> Hi all, >> >> This pull request contains a backport of commit [ee87d187](https://github.com/openjdk/jdk/commit/ee87d187d1cab09317b4f0068bfafc68efbbfe56) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. >> >> The commit being backported was authored by Daniel Lund?n on 31 Jan 2025 and was reviewed by Aleksey Shipilev and Vladimir Kozlov. >> >> Thanks! > > Marked as reviewed by epeter (Reviewer). Thanks for the review @eme64. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23390#issuecomment-2631738690 From jpai at openjdk.org Mon Feb 3 18:24:51 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 18:24:51 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. Thank you all for the help on this one. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631740096 From dlunden at openjdk.org Mon Feb 3 18:24:53 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 3 Feb 2025 18:24:53 GMT Subject: [jdk24] Integrated: 8348658: [AArch64] The node limit in compiler/codegen/TestMatcherClone.java is too strict In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 14:25:30 GMT, Daniel Lund?n wrote: > Hi all, > > This pull request contains a backport of commit [ee87d187](https://github.com/openjdk/jdk/commit/ee87d187d1cab09317b4f0068bfafc68efbbfe56) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Daniel Lund?n on 31 Jan 2025 and was reviewed by Aleksey Shipilev and Vladimir Kozlov. > > Thanks! This pull request has now been integrated. Changeset: 47c15b5f Author: Daniel Lund?n URL: https://git.openjdk.org/jdk/commit/47c15b5ff8734758679b6678f56475ea8e449df1 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8348658: [AArch64] The node limit in compiler/codegen/TestMatcherClone.java is too strict Reviewed-by: shade, epeter Backport-of: ee87d187d1cab09317b4f0068bfafc68efbbfe56 ------------- PR: https://git.openjdk.org/jdk/pull/23390 From jpai at openjdk.org Mon Feb 3 18:24:52 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 18:24:52 GMT Subject: Integrated: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 15:48:00 GMT, Jaikiran Pai wrote: > Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? > > The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. > > The backout was done as follows, using `git revert` against the 2 relevant commits: > > > git revert 74ae3c688b37e693e20eb4e17c631897c5464400 > git revert 5890d9438bbde88b89070052926a2eafe13d7b42 > > The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. > > tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. This pull request has now been integrated. Changeset: 618c5eb2 Author: Jaikiran Pai URL: https://git.openjdk.org/jdk/commit/618c5eb27b4c719afd577b690e6bcb21a45fcb0d Stats: 169 lines in 6 files changed: 46 ins; 79 del; 44 mod 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null 8349239: [BACKOUT] Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt Reviewed-by: redestad, liach ------------- PR: https://git.openjdk.org/jdk/pull/23420 From jpai at openjdk.org Mon Feb 3 18:47:27 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 18:47:27 GMT Subject: [jdk24] RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null Message-ID: Can I please get a review of this backport of https://github.com/openjdk/jdk/pull/23420 into jdk24? This proposes to bring in those same backouts into `jdk24` to prevent the issue noted in that PR description. jdk24 is in rampdown and this backport will require an approval. A approval request has been raised in https://bugs.openjdk.org/browse/JDK-8349183?focusedId=14746841&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14746841. This backport into jdk24 wasn't clean due to a trivial merge conflict in `StringLatin1.java` file. That merge conflict was manually resolved (just like it was done against mainline). The git commands used to create this backport against jdk24 branch are: git cherry-pick --no-commit 618c5eb27b4c719afd577b690e6bcb21a45fcb0d git commit -m 'Backport 618c5eb27b4c719afd577b690e6bcb21a45fcb0d' tier1, tier2 and tier3 testing is currently in progress with this change. ------------- Commit messages: - Backport 618c5eb27b4c719afd577b690e6bcb21a45fcb0d Changes: https://git.openjdk.org/jdk/pull/23425/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23425&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349183 Stats: 169 lines in 6 files changed: 46 ins; 79 del; 44 mod Patch: https://git.openjdk.org/jdk/pull/23425.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23425/head:pull/23425 PR: https://git.openjdk.org/jdk/pull/23425 From kvn at openjdk.org Mon Feb 3 18:58:47 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 3 Feb 2025 18:58:47 GMT Subject: [jdk24] RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 18:41:44 GMT, Jaikiran Pai wrote: > Can I please get a review of this backport of https://github.com/openjdk/jdk/pull/23420 into jdk24? > > This proposes to bring in those same backouts into `jdk24` to prevent the issue noted in that PR description. jdk24 is in rampdown and this backport will require an approval. A approval request has been raised in https://bugs.openjdk.org/browse/JDK-8349183?focusedId=14746841&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14746841. > > This backport into jdk24 wasn't clean due to a trivial merge conflict in `StringLatin1.java` file. That merge conflict was manually resolved (just like it was done against mainline). The git commands used to create this backport against jdk24 branch are: > > > > git cherry-pick --no-commit 618c5eb27b4c719afd577b690e6bcb21a45fcb0d > > git commit -m 'Backport 618c5eb27b4c719afd577b690e6bcb21a45fcb0d' > > > tier1, tier2 and tier3 testing is currently in progress with this change. @jaikiran You need to get approval for JDK 24 backport. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23425#issuecomment-2631808683 From swen at openjdk.org Mon Feb 3 19:02:56 2025 From: swen at openjdk.org (Shaojin Wen) Date: Mon, 3 Feb 2025 19:02:56 GMT Subject: RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:37:09 GMT, Jaikiran Pai wrote: >> Can I please get a review of this change which backs out the commit that was introduced for https://bugs.openjdk.org/browse/JDK-8333893? >> >> The comment in the PR review of that issue https://github.com/openjdk/jdk/pull/19626#issuecomment-2628937397 explains what the issue is with the change that was integrated. Furthermore, one part of that original change introduced a few internal methods. These internal methods were then used in few other places within the JDK through https://bugs.openjdk.org/browse/JDK-8343650. As a result, this backout PR also reverts the change that was done in JDK-8343650. >> >> The backout was done as follows, using `git revert` against the 2 relevant commits: >> >> >> git revert 74ae3c688b37e693e20eb4e17c631897c5464400 >> git revert 5890d9438bbde88b89070052926a2eafe13d7b42 >> >> The revert of `5890d9438bbde88b89070052926a2eafe13d7b42` wasn't clean and I had to resolve a trivial conflict in `StringLatin1.java`. >> >> tier1, tier2 and tier3 testing is currently in progress with this change. Once this is integrated into mainline, a corresponding backport will be done to `jdk24` branch with the requisite approvals. > > Jaikiran Pai has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains two new commits since the last revision: > > - Revert "8333893: Optimization for StringBuilder append boolean & null" > > This reverts commit 5890d9438bbde88b89070052926a2eafe13d7b42. > - Revert "8343650: Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt" > > This reverts commit 74ae3c688b37e693e20eb4e17c631897c5464400. The problem still exists, we need to complete another PR https://github.com/openjdk/jdk/pull/23427 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23420#issuecomment-2631816765 From jpai at openjdk.org Mon Feb 3 19:10:51 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Mon, 3 Feb 2025 19:10:51 GMT Subject: [jdk24] RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 18:55:56 GMT, Vladimir Kozlov wrote: > @jaikiran You need to get approval for JDK 24 backport. Agreed. A approval request has already been raised https://bugs.openjdk.org/browse/JDK-8349183?focusedId=14746841&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14746841. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23425#issuecomment-2631838094 From epeter at openjdk.org Mon Feb 3 19:10:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Feb 2025 19:10:52 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <-c7xXeuSN-6QD-k6MA1-7Cv17ztENnt7Q0U6PprRrf0=.afd866ca-f6d8-4f09-8c66-ada7bd38c67f@github.com> Message-ID: On Mon, 3 Feb 2025 18:06:55 GMT, Vladimir Kozlov wrote: >>> The issue is we should avoid creating new irreducible loops. >> >> Absolutely. This is not a fix at all. But since we can catch that the state is wrong here, we should bail out instead of continuing on in production. That is at least a little improvement. The bug-fix would come in a second step, and may be much more complicated as it would have to reconsider what to do about `split_if`. Do we postpone until after loop-opts for example? Before considering such an involved fix, I would rather want to do this "defensive" patch first. Does that make sense? This also allows us to integrate [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570), which is currently blocked by the failing assert (which is now disabled, but bailout instead, can be enabled with the flag). >> >>> What if l is already marked as irreducible l->_irreducible = 1 by following code? I don't see check above for such case. >> >> If `l->_irreducible = 1`, then the corresponding node should have been marked as `MaybeIrreducibleEntry`, and so should also `m` be marked with `MaybeIrreducibleEntry`. If that is the case, we are fine, these `Region` were created at parsing and we currently consider that fine. >> >> But if one of the `Region` of the now irreducible loop was not marked accordingly already during parsing, then a new irreducible loop appeared during compilation - and that's not good. >> >>> And again come here but secondary_entry can be irreducible (it is Region for example or already marked) and we skip bailout. Why it is okay? >> >> If we marked a `Region` with `MaybeIrreducibleEntry`, then we treat it differently in some optimizations. For example, when the region loses a control input, we have to check if the loop is now dead, with a global connectivity search. For `LoopNode` losing the control input means we already know that the loop is dead, since that was the only loop entry. But for irreducible loops, losing one entry means we do not know if there is still a secondary entry or not. For context see [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126). >> >> Does that answer your question? If not, we may have to talk about it offline ;) >> >> PS: >> The whole irreducible loop handling is still broken actually, see [JDK-8308675](https://bugs.openjdk.org/browse/JDK-8308675). We plan to eventually also disallow creation of irreducible loops at parsing. But that's a big project, and we were hoping for a student project. But we still also should not introduce new irreducible loop... > > First, I am fine with this "band-aid" change. I understand that it simple replaces assert with bailout which is fine. > But I am trying to understand what it does. > > There are few states when we come to this part of code: > - `l` is or not marked as irreducible > - `m` is or not marked with MaybeIrreducibleEntry (is it set only for not Loop?) > - `m` is or not Loop > > So we have 8 combinations. I would like to hear reasons in which cases we should bailout and in which not. Yeah, the existing code is not exactly pretty or straight forward ? ------------------------------- Let me explain what we know when we get here, with `m` as `secondary_entry`: `m`: the CFG node we are currently looking at in the DFS. We know that `m` is located inside the `IdealLoopTree* l`. But now we just found out that `l` is already `postvisited`, i.e. that we already walked all CFG nodes in `l` before, and exited back through its original loop entry node. That means the traversal has left the `l` loop structure, and has now found a second way into that loop structure `l` at `m`. Hence, we know that `m` must be a secondary entry to `l`. We now know that `l` must be irreducible, so we set `l->_irreducible = 1`. ---------------------------------- This is the `LoopStatus` definition: enum LoopStatus { // No guarantee: the region may be an irreducible loop entry, thus we have to // be careful when removing entry control to it. MaybeIrreducibleEntry, // Limited guarantee: this region may be (nested) inside an irreducible loop, // but it will never be an irreducible loop entry. NeverIrreducibleEntry, // Strong guarantee: this region is not (nested) inside an irreducible loop. Reducible, }; ------------------------------- Here about the combinations: - I don't think that it matters if `l` is already marked `irreducible` or not. For example if `m` is marked with `NeverIrreducibleEntry`, then `l` could be irreducible, i.e. have multiple entries. But `m` is not allowed to be one of those. - For `m` we have 3 cases: - `MaybeIrreducibleEntry`: everything is ok, we expect that `m` could be a secondary entry to a loop. - `NeverIrreducibleEntry`: it would be ok if `l` is irreducible, but that should not happen through `m` as a secondary entry. An assumption was violated - the graph state is incoherent. - `Reducible`: Here we do not expect `l` to be irreducible, and `m` should not be a secondary entry either. An assumption was violated - the graph state is incoherent. That is already sufficient to justify the bailout here. ------------------------------ But let me consider if `m` is a `Loop` or just `Region` anyway. What if `m` is: - NOT `LoopNode`, just `Region`: we just found a new edge entering `l`. This edge was not reachable from inside the loop structure `l`, and so it must be a secondary entry. - `LoopNode`, it has an entry `in(1)` and a backedge `in(2)`: - Assume we just came via the backedge: that is contradictory, as the backedge should be reachable from inside the loop, and hence we should already have visited that edge before declaring `l` postvisited. - Assume we came via the entry: I suppose that looks ok at first.... But we only turn `Region` into `Loop` if we are sure it is not in an irreducible loop, so I think the assumption is that a `LoopNode` only has reducible CFG inside its loop structure. Look at `beautify_loops`: After we know that the `Region` only has 2 entries, and the `l` is reducible: } else if (!_head->is_Loop() && !_irreducible) { // Make a new LoopNode to replace the old loop head Node *l = new LoopNode( _head->in(1), _head->in(2) ); Hence, if we found any secondary entry into a `LoopNode* m`, that would also be a contradiction. I could be wrong about this, and again, I think it does not matter. All that matters is that we already have a contradiction if `m` is not marked with `MaybeIrreducibleEntry`, as we saw above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1939891275 From aph at openjdk.org Mon Feb 3 19:57:42 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 3 Feb 2025 19:57:42 GMT Subject: RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64 In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:14:23 GMT, Jamil Nimeh wrote: > In terms of explaining the algorithm changes, I could add some comment text to the header of the stub function that better explains the general idea behind what is being done. It would certainly help anyone maintaining it down the line (myself included). Your call if you want to describe it yourself, or quote the paper, or both. But please do one! The paper is rather readable. ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2631929964 From duke at openjdk.org Mon Feb 3 23:33:18 2025 From: duke at openjdk.org (duke) Date: Mon, 3 Feb 2025 23:33:18 GMT Subject: Withdrawn: 8341908: CodeHeapAnalytics: Output Imperfections and unwanted vm termination In-Reply-To: References: Message-ID: On Thu, 10 Oct 2024 14:45:55 GMT, Lutz Schmidt wrote: > Output is properly aligned again now. Was messed up when method hotness was removed (part of method sweeper). > Assertions have been replaced by printing an error message and gracefully returning. Avoids vm crashes caused by diagnostic actions. > Some code restructuring, removal of redundancies. > > Reviews are highly welcomed. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/21452 From jnimeh at openjdk.org Mon Feb 3 23:56:18 2025 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Mon, 3 Feb 2025 23:56:18 GMT Subject: RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64 [v2] In-Reply-To: References: Message-ID: > This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64. Assembly language profiling yielded an 11% improvement in throughput. When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement. Jamil Nimeh has updated the pull request incrementally with one additional commit since the last revision: Add explanatory comment and reference for quarter round intrinsic ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23397/files - new: https://git.openjdk.org/jdk/pull/23397/files/41817c77..6ba0770b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23397&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23397&range=00-01 Stats: 25 lines in 1 file changed: 25 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23397.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23397/head:pull/23397 PR: https://git.openjdk.org/jdk/pull/23397 From amitkumar at openjdk.org Tue Feb 4 03:08:29 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 4 Feb 2025 03:08:29 GMT Subject: Integrated: 8349193: compiler/intrinsics/TestContinuationPinningAndEA.java missing @requires vm.continuations In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 03:04:03 GMT, Amit Kumar wrote: > As title says, test is missing require vm.continuations, as continuations are not yet supported on s390x, I saw this test failure in my recent builds. This pull request has now been integrated. Changeset: 7ea176d7 Author: Amit Kumar URL: https://git.openjdk.org/jdk/commit/7ea176d79c126c69cea5631d6542cd42bd8b11d9 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8349193: compiler/intrinsics/TestContinuationPinningAndEA.java missing @requires vm.continuations Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23412 From amitkumar at openjdk.org Tue Feb 4 03:08:28 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 4 Feb 2025 03:08:28 GMT Subject: RFR: 8349193: compiler/intrinsics/TestContinuationPinningAndEA.java missing @requires vm.continuations In-Reply-To: References: Message-ID: <1SumnVdt2NpnNqDL7pHnU5X8HMvP8J0VUWbBT1W1HZA=.15324885-f5ff-458d-b6c9-73426a209f11@github.com> On Mon, 3 Feb 2025 09:01:38 GMT, Christian Hagedorn wrote: >> As title says, test is missing require vm.continuations, as continuations are not yet supported on s390x, I saw this test failure in my recent builds. > > Looks good and trivial. Thanks @chhagedorn for the approval. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23412#issuecomment-2632713978 From jbhateja at openjdk.org Tue Feb 4 06:18:13 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 06:18:13 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 14:36:54 GMT, Quan Anh Mai wrote: > Could you instead do this by trying to transform `AndI(MoveF2I(x), MoveF2I(y))` into `AndF(x, y)` instead? @merykitty , this patch does not break existing IR invariants as multiple targets already emit efficient instruction sequences for it, we have just improved upon the x86-backed implementation. ![image](https://github.com/user-attachments/assets/61845793-ca3a-4ad2-8ee8-210f8a1bc60d) Introducing another new IR "AndF" will again need changes in auto-vectorization. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2632973584 From fyang at openjdk.org Tue Feb 4 07:06:22 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 4 Feb 2025 07:06:22 GMT Subject: RFR: 8347489: RISC-V: Misaligned memory access with COH [v8] In-Reply-To: References: Message-ID: > Hi, please consider this change. > > We have different base_offset for T_BYTE/T_CHAR (4-byte instead of 8-byte aligned) with COH. This causes misaligned memory accesses for several instrinsics like String.Compare or String.Equals. The reason is that we assume 8-byte alignment and process one 8-byte word starting at the first array element for each iteration in the main loop. As a result, we have performance regressions on platforms with slow misaligned memory accesses like Unmatched and Premier P550 SBCs. > > PS: Same issue is there even without COH. base_offset for T_BYTE/T_CHAR is 20 (thus 4-byte aligned) when `UseCompressedClassPointers` is disabled in this case. > > Correctness test on linux-riscv64: > - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (release) > - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (release) > - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (fastdebug) > - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (fastdebug) > > Performance test on Premier P550 (-XX:+AlwaysPreTouch -Xms8g -Xmx8g): > > SPECjbb2005: > > 1. Without Patch > 1.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32666 > 1.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 27610 > 1.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30911 > 1.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 26008 > > 2. With Patch > 2.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32820 > 2.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 34179 > 2.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30620 > 2.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 31936 > > > SPECjbb2015: > > 1. Without Patch > 1.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1444, critical-jOPS = 431 > 1.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1092, critical-jOPS = 335 > > 2. With Patch > 2.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1452, critical-jOPS = 419 > 2.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1438, critical-jOPS = 477 Fei Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Review comment - Merge branch 'master' into JDK-8347489 - Merge branch 'master' into JDK-8347489 - Review comment - Review comment - Merge branch 'master' into JDK-8347489 - Merge branch 'master' into JDK-8347489 - Comment - Fix assertions - Add assertions - ... and 2 more: https://git.openjdk.org/jdk/compare/7ea176d7...db326650 ------------- Changes: https://git.openjdk.org/jdk/pull/23053/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23053&range=07 Stats: 132 lines in 3 files changed: 106 ins; 2 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/23053.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23053/head:pull/23053 PR: https://git.openjdk.org/jdk/pull/23053 From fyang at openjdk.org Tue Feb 4 07:06:23 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 4 Feb 2025 07:06:23 GMT Subject: RFR: 8347489: RISC-V: Misaligned memory access with COH [v7] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 12:29:47 GMT, Hamlin Li wrote: >> Fei Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Merge branch 'master' into JDK-8347489 >> - Review comment >> - Review comment >> - Merge branch 'master' into JDK-8347489 >> - Merge branch 'master' into JDK-8347489 >> - Comment >> - Fix assertions >> - Add assertions >> - Comment >> - 8347489: RISC-V: Misaligned memory access with COH > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1463: > >> 1461: { >> 1462: if (str1_isL == str2_isL) { // LL or UU >> 1463: #ifdef ASSERT > > Can we add the some comment about 8-bytes alignment below at line 1520? > I know it's redundant, but as the code is getting more complicated, it might be good to add it to improve the readability. Sure. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23053#discussion_r1940606717 From epeter at openjdk.org Tue Feb 4 07:23:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 07:23:11 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <-c7xXeuSN-6QD-k6MA1-7Cv17ztENnt7Q0U6PprRrf0=.afd866ca-f6d8-4f09-8c66-ada7bd38c67f@github.com> Message-ID: <8B6TY4A1E4255-ggl7DI31oerizDCuI6Jlf6vo98oUA=.afc1fc06-ea89-461a-a992-bf22afa2ea6f@github.com> On Mon, 3 Feb 2025 18:06:55 GMT, Vladimir Kozlov wrote: >>> The issue is we should avoid creating new irreducible loops. >> >> Absolutely. This is not a fix at all. But since we can catch that the state is wrong here, we should bail out instead of continuing on in production. That is at least a little improvement. The bug-fix would come in a second step, and may be much more complicated as it would have to reconsider what to do about `split_if`. Do we postpone until after loop-opts for example? Before considering such an involved fix, I would rather want to do this "defensive" patch first. Does that make sense? This also allows us to integrate [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570), which is currently blocked by the failing assert (which is now disabled, but bailout instead, can be enabled with the flag). >> >>> What if l is already marked as irreducible l->_irreducible = 1 by following code? I don't see check above for such case. >> >> If `l->_irreducible = 1`, then the corresponding node should have been marked as `MaybeIrreducibleEntry`, and so should also `m` be marked with `MaybeIrreducibleEntry`. If that is the case, we are fine, these `Region` were created at parsing and we currently consider that fine. >> >> But if one of the `Region` of the now irreducible loop was not marked accordingly already during parsing, then a new irreducible loop appeared during compilation - and that's not good. >> >>> And again come here but secondary_entry can be irreducible (it is Region for example or already marked) and we skip bailout. Why it is okay? >> >> If we marked a `Region` with `MaybeIrreducibleEntry`, then we treat it differently in some optimizations. For example, when the region loses a control input, we have to check if the loop is now dead, with a global connectivity search. For `LoopNode` losing the control input means we already know that the loop is dead, since that was the only loop entry. But for irreducible loops, losing one entry means we do not know if there is still a secondary entry or not. For context see [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126). >> >> Does that answer your question? If not, we may have to talk about it offline ;) >> >> PS: >> The whole irreducible loop handling is still broken actually, see [JDK-8308675](https://bugs.openjdk.org/browse/JDK-8308675). We plan to eventually also disallow creation of irreducible loops at parsing. But that's a big project, and we were hoping for a student project. But we still also should not introduce new irreducible loop... > > First, I am fine with this "band-aid" change. I understand that it simple replaces assert with bailout which is fine. > But I am trying to understand what it does. > > There are few states when we come to this part of code: > - `l` is or not marked as irreducible > - `m` is or not marked with MaybeIrreducibleEntry (is it set only for not Loop?) > - `m` is or not Loop > > So we have 8 combinations. I would like to hear reasons in which cases we should bailout and in which not. @vnkozlov Does that help you, or do you have more questions? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1940624374 From chagedorn at openjdk.org Tue Feb 4 07:36:17 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 4 Feb 2025 07:36:17 GMT Subject: Integrated: 8346774: Use Predicate classes instead of Node classes In-Reply-To: References: Message-ID: On Wed, 22 Jan 2025 13:10:37 GMT, Christian Hagedorn wrote: > This small cleanup PR replaces a lot of usages of `Node` pointers, to pass around either the head (i.e. `IfNode`) or the tail (i.e. a success projection) of predicates, with actual `Predicate` classes. This simplifies the usages, readability and the logical flow, and enables more simplifications in the future, especially once we replace Template Assertion Predicates with a dedicated node. > > I've also included some minor refactorings like adding `const` or fixing typos. > > There are no semantic changes involved. The return value optimization should take care to avoid a lot of copies when returning new objects from methods. > > Thanks, > Christian This pull request has now been integrated. Changeset: c545a3e0 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/c545a3e028ad0760ed2f996e8bb7c56d28e4570a Stats: 124 lines in 2 files changed: 29 ins; 4 del; 91 mod 8346774: Use Predicate classes instead of Node classes Reviewed-by: epeter, kvn ------------- PR: https://git.openjdk.org/jdk/pull/23234 From thartmann at openjdk.org Tue Feb 4 07:58:15 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 4 Feb 2025 07:58:15 GMT Subject: RFR: 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 14:43:30 GMT, Roland Westrelin wrote: >> I investigated the failure from the `Test.java` that's attached to the >> bug. The failure with this test is only reproducible up to 8334060 >> (Implementation of Late Barrier Expansion for G1) so experiments I >> describe here are from the source code for the commit right before it. >> >> Peak malloc memory usage reported by NMT is: 1.3GB >> >> `PhaseCFG::global_code_motion()`, when `OptoRegScheduling` is true, >> creates a `PhaseIFG` that's, when initialized, allocates `_adjs`: a >> `maxlrg` array of `IndexSet`s that can contain up to `maxlrg`. >> >> `maxlrg` in this case is 122839. An `IndexSet` is an array of pointers >> to a 256 bit bitset: one `IndexSet` array needs: >> >> >> 122839 / 256 * 8 = 3832 >> >> >> and there are of 122839: >> >> >> 3832 * 122839 = ~470 MB >> >> >> It turns out the `PhaseIFG` object when used from >> `PhaseCFG::global_code_motion()` doesn't even use the `_adjs` >> array. So a patch like: >> >> >> diff --git a/src/hotspot/share/opto/chaitin.hpp b/src/hotspot/share/opto/chaitin.hpp >> index cf02deb6019..4e5333bf181 100644 >> --- a/src/hotspot/share/opto/chaitin.hpp >> +++ b/src/hotspot/share/opto/chaitin.hpp >> @@ -258,7 +258,7 @@ class PhaseIFG : public Phase { >> VectorSet *_yanked; >> >> PhaseIFG( Arena *arena ); >> - void init( uint maxlrg ); >> + void init( uint maxlrg, bool no_adjs = false ); >> >> // Add edge between a and b. Returns true if actually added. >> int add_edge( uint a, uint b ); >> diff --git a/src/hotspot/share/opto/gcm.cpp b/src/hotspot/share/opto/gcm.cpp >> index ebdefe597ff..fefd75a88c5 100644 >> --- a/src/hotspot/share/opto/gcm.cpp >> +++ b/src/hotspot/share/opto/gcm.cpp >> @@ -1704,7 +1704,9 @@ void PhaseCFG::global_code_motion() { >> rm_live.reset_to_mark(); // Reclaim working storage >> IndexSet::reset_memory(C, &live_arena); >> uint node_size = regalloc._lrg_map.max_lrg_id(); >> - ifg.init(node_size); // Empty IFG >> + ifg.init(node_size, true); // Empty IFG >> regalloc.set_ifg(ifg); >> regalloc.set_live(live); >> regalloc.gather_lrg_masks(false); // Collect LRG masks >> diff --git a/src/hotspot/share/opto/ifg.cpp b/src/hotspot/share/opto/ifg.cpp >> index d12698121b9..e42121c2254 100644 >> --- a/src/hotspot/share/opto/ifg.cpp >> +++ b/src/hotspot/share/opto/ifg.cpp >> @@ -42,18 +42,24 @@ >> PhaseIFG::PhaseIFG( Arena *arena ) : Phase(Interference_Graph), _arena(arena) { >> } >> >> -void PhaseIFG::init( uint maxlrg ) { >> +void PhaseIFG::init( uint maxlrg, bool no_adjs ) { >> ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-8333697 > - fix Thanks. I submitted testing and will report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23075#pullrequestreview-2592008463 From epeter at openjdk.org Tue Feb 4 08:52:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 08:52:15 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: <_5bwBRKG8Zu7iywOJZ6WgUb6N4so1sAO6Ua8S0zQU94=.3200ef74-4e50-424b-a3da-637be63e3f0c@github.com> On Mon, 3 Feb 2025 18:11:11 GMT, Jatin Bhateja wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java >> >> Co-authored-by: Emanuel Peter > > Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments. @jatin-bhateja Testing is all green :green_circle: Doing a last pass over the code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2633248273 From epeter at openjdk.org Tue Feb 4 09:03:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 09:03:17 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java > > Co-authored-by: Emanuel Peter src/hotspot/share/opto/convertnode.hpp line 222: > 220: class ReinterpretS2HFNode : public Node { > 221: public: > 222: ReinterpretS2HFNode(Node* in1) : Node(0, in1) {} Suggestion: ReinterpretS2HFNode(Node* in1) : Node(nullptr, in1) {} Oh, just caught this. I think you should not use `0` here any more, check all other uses. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940762320 From mli at openjdk.org Tue Feb 4 09:12:11 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 4 Feb 2025 09:12:11 GMT Subject: RFR: 8347489: RISC-V: Misaligned memory access with COH [v8] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 07:06:22 GMT, Fei Yang wrote: >> Hi, please consider this change. >> >> We have different base_offset for T_BYTE/T_CHAR (4-byte instead of 8-byte aligned) with COH. This causes misaligned memory accesses for several instrinsics like String.Compare or String.Equals. The reason is that we assume 8-byte alignment and process one 8-byte word starting at the first array element for each iteration in the main loop. As a result, we have performance regressions on platforms with slow misaligned memory accesses like Unmatched and Premier P550 SBCs. >> >> PS: Same issue is there even without COH. base_offset for T_BYTE/T_CHAR is 20 (thus 4-byte aligned) when `UseCompressedClassPointers` is disabled in this case. >> >> Correctness test on linux-riscv64: >> - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (release) >> - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (release) >> - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (fastdebug) >> - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (fastdebug) >> >> Performance test on Premier P550 (-XX:+AlwaysPreTouch -Xms8g -Xmx8g): >> >> SPECjbb2005: >> >> 1. Without Patch >> 1.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32666 >> 1.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 27610 >> 1.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30911 >> 1.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 26008 >> >> 2. With Patch >> 2.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32820 >> 2.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 34179 >> 2.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30620 >> 2.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 31936 >> >> >> SPECjbb2015: >> >> 1. Without Patch >> 1.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1444, critical-jOPS = 431 >> 1.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1092, critical-jOPS = 335 >> >> 2. With Patch >> 2.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1452, critical-jOPS = 419 >> 2.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1438, critical-jOPS = 477 > > Fei Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Review comment > - Merge branch 'master' into JDK-8347489 > - Merge branch 'master' into JDK-8347489 > - Review comment > - Review comment > - Merge branch 'master' into JDK-8347489 > - Merge branch 'master' into JDK-8347489 > - Comment > - Fix assertions > - Add assertions > - ... and 2 more: https://git.openjdk.org/jdk/compare/7ea176d7...db326650 Thank you! ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23053#pullrequestreview-2592178000 From epeter at openjdk.org Tue Feb 4 09:16:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 09:16:17 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java > > Co-authored-by: Emanuel Peter Ooops, I found a few more details. But the C++ VM changes look really good now. The Java changes I leave to @PaulSandoz src/hotspot/share/opto/convertnode.cpp line 971: > 969: return true; > 970: default: > 971: return false; Does this cover all cases? What about `FmaHF`? src/hotspot/share/opto/convertnode.hpp line 234: > 232: class ReinterpretHF2SNode : public Node { > 233: public: > 234: ReinterpretHF2SNode(Node* in1) : Node(0, in1) {} Suggestion: ReinterpretHF2SNode(Node* in1) : Node(nullptr, in1) {} src/hotspot/share/opto/divnode.cpp line 866: > 864: // Dividing by self is 1. > 865: // IF the divisor is 1, we are an identity on the dividend. > 866: Node* DivHFNode::Identity(PhaseGVN* phase) { Remove line with `isA_Copy`. src/hotspot/share/opto/type.cpp line 1106: > 1104: if (_base == FloatBot || _base == FloatTop) return FLOAT; > 1105: if (_base == HalfFloatTop || _base == HalfFloatBot) return Type::BOTTOM; > 1106: if (_base == DoubleTop || _base == DoubleBot) return Type::BOTTOM; If you already fixing the style, you should use curly braces as I said above ;) src/hotspot/share/opto/type.cpp line 1472: > 1470: //------------------------------meet------------------------------------------- > 1471: // Compute the MEET of two types. It returns a new Type object. > 1472: const Type* TypeH::xmeet(const Type* t) const { Suggestion: //------------------------------xmeet------------------------------------------- // Compute the MEET of two types. It returns a new Type object. const Type* TypeH::xmeet(const Type* t) const { ------------- PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2592155651 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940766035 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940763403 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940766624 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940771256 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940771662 From aph at openjdk.org Tue Feb 4 09:23:12 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 4 Feb 2025 09:23:12 GMT Subject: RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64 [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 23:56:18 GMT, Jamil Nimeh wrote: >> This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64. Assembly language profiling yielded an 11% improvement in throughput. When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement. > > Jamil Nimeh has updated the pull request incrementally with one additional commit since the last revision: > > Add explanatory comment and reference for quarter round intrinsic Marked as reviewed by aph (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23397#pullrequestreview-2592206831 From chagedorn at openjdk.org Tue Feb 4 09:49:03 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 4 Feb 2025 09:49:03 GMT Subject: RFR: 8346777: Remove unneeded ReplaceInitAndStrideStrategy and add missing const declarations Message-ID: This patch's main goal is to remove the unneeded `ReplaceInitAndStrideStrategy`. We can use the existing `ReplaceInitAndCloneStrideStrategy` instead. The reason behind is that when splitting a loop as part of a loop optimization, we are always keeping the stride the same (i.e. can use `ReplaceInitAndCloneStrideStrategy`) except for when unrolling a loop. In that case, we keep the init value and update the stride instead. This is done with `UpdateStrideForAssertionPredicates`. `ReplaceInitAndCloneStrideStrategy` was used as an intermediate step while applying more refactorings. It's now time to abandon it. I also cover other mostly minor things with this change: - Adding missing `const` declarations. - Renaming `ctrl` -> `control` - Swapping order of parameters Thanks, Christian ------------- Commit messages: - 8346777: Remove unneeded ReplaceInitAndStrideStrategy and add missing const declarations Changes: https://git.openjdk.org/jdk/pull/23434/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23434&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8346777 Stats: 106 lines in 2 files changed: 1 ins; 37 del; 68 mod Patch: https://git.openjdk.org/jdk/pull/23434.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23434/head:pull/23434 PR: https://git.openjdk.org/jdk/pull/23434 From jbhateja at openjdk.org Tue Feb 4 10:05:09 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 10:05:09 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: > Hi All, > > This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) > > Following is the summary of changes included with this patch:- > > 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. > 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. > 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. > - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. > 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. > 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. > 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. > 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF > 9. X86 backend implementation for all supported intrinsics. > 10. Functional and Performance validation tests. > > Kindly review the patch and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Fixing typos ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22754/files - new: https://git.openjdk.org/jdk/pull/22754/files/8207c9ff..82a42213 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=15-16 Stats: 13 lines in 3 files changed: 0 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/22754.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22754/head:pull/22754 PR: https://git.openjdk.org/jdk/pull/22754 From jbhateja at openjdk.org Tue Feb 4 10:05:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 10:05:11 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 18:11:11 GMT, Jatin Bhateja wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java >> >> Co-authored-by: Emanuel Peter > > Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments. > @jatin-bhateja Testing is all green ? Doing a last pass over the code. Thanks @eme64, looking forward to your approval :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2633414710 From jbhateja at openjdk.org Tue Feb 4 10:05:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 10:05:11 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 09:03:09 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java >> >> Co-authored-by: Emanuel Peter > > src/hotspot/share/opto/convertnode.cpp line 971: > >> 969: return true; >> 970: default: >> 971: return false; > > Does this cover all cases? What about `FmaHF`? FmaHF is a ternary operation and is intrinsified. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940855109 From roland at openjdk.org Tue Feb 4 10:11:36 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 4 Feb 2025 10:11:36 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v9] In-Reply-To: References: Message-ID: > To optimize a long counted loop and long range checks in a long or int > counted loop, the loop is turned into a loop nest. When the loop has > few iterations, the overhead of having an outer loop whose backedge is > never taken, has a measurable cost. Furthermore, creating the loop > nest usually causes one iteration of the loop to be peeled so > predicates can be set up. If the loop is short running, then it's an > extra iteration that's run with range checks (compared to an int > counted loop with int range checks). > > This change doesn't create a loop nest when: > > 1- it can be determined statically at loop nest creation time that the > loop runs for a short enough number of iterations > > 2- profiling reports that the loop runs for no more than ShortLoopIter > iterations (1000 by default). > > For 2-, a guard is added which is implemented as yet another predicate. > > While this change is in principle simple, I ran into a few > implementation issues: > > - while c2 has a way to compute the number of iterations of an int > counted loop, it doesn't have that for long counted loop. The > existing logic for int counted loops promotes values to long to > avoid overflows. I reworked it so it now works for both long and int > counted loops. > > - I added a new deoptimization reason (Reason_short_running_loop) for > the new predicate. Given the number of iterations is narrowed down > by the predicate, the limit of the loop after transformation is a > cast node that's control dependent on the short running loop > predicate. Because once the counted loop is transformed, it is > likely that range check predicates will be inserted and they will > depend on the limit, the short running loop predicate has to be the > one that's further away from the loop entry. Now it is also possible > that the limit before transformation depends on a predicate > (TestShortRunningLongCountedLoopPredicatesClone is an example), we > can have: new predicates inserted after the transformation that > depend on the casted limit that itself depend on old predicates > added before the transformation. To solve this cicular dependency, > parse and assert predicates are cloned between the old predicates > and the loop head. The cloned short running loop parse predicate is > the one that's used to insert the short running loop predicate. > > - In the case of a long counted loop, the loop is transformed into a > regular loop with a new limit and transformed range checks that's > later turned into an in counted loop. The int ... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 32 commits: - TestMemorySegment test fix - test wip - Merge branch 'master' into JDK-8342692 - refactor - Merge branch 'master' into JDK-8342692 - Merge branch 'master' into JDK-8342692 - Merge branch 'master' into JDK-8342692 - Merge branch 'master' into JDK-8342692 - review - reviews - ... and 22 more: https://git.openjdk.org/jdk/compare/3f1d9b57...7dd6fde9 ------------- Changes: https://git.openjdk.org/jdk/pull/21630/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=08 Stats: 1316 lines in 25 files changed: 1254 ins; 16 del; 46 mod Patch: https://git.openjdk.org/jdk/pull/21630.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21630/head:pull/21630 PR: https://git.openjdk.org/jdk/pull/21630 From roland at openjdk.org Tue Feb 4 10:14:18 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 4 Feb 2025 10:14:18 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v8] In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 09:14:55 GMT, Tobias Hartmann wrote: > #21926 is in now. Should I submit testing? Yes, please. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-2633443925 From roland at openjdk.org Tue Feb 4 10:14:17 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 4 Feb 2025 10:14:17 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v9] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 10:11:36 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 32 commits: > > - TestMemorySegment test fix > - test wip > - Merge branch 'master' into JDK-8342692 > - refactor > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - review > - reviews > - ... and 22 more: https://git.openjdk.org/jdk/compare/3f1d9b57...7dd6fde9 I tweaked `compiler/loopopts/superword/TestMemorySegment.java`: a couple more tests pass now if `ShortRunningLongLoop` is true. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-2633443524 From chagedorn at openjdk.org Tue Feb 4 10:26:18 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 4 Feb 2025 10:26:18 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v12] In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 14:02:51 GMT, Daniel Lund?n wrote: >> When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks. >> >> #### Example 1 >> >> Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 _after_ a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value). >> >> ![failure-graph-1](https://github.com/user-attachments/assets/e5458646-7a5c-40e1-b1d8-e3f101e29b73) >> ![failure-blocks-1](https://github.com/user-attachments/assets/a0b1f724-0809-4b2f-9feb-93e9c59a5d6a) >> >> #### Example 2 >> >> There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9. >> >> ![failure-graph-2](https://github.com/user-attachments/assets/ede0c299-6251-4ff8-8b84-af40a1ee9e8c) >> ![failure-blocks-2](https://github.com/user-attachments/assets/e5a87e43-b6fe-4fa3-8961-54752f63633e) >> >> ### Cha... > > Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: > > - Merge remote-tracking branch 'upstream/master' into insert-anti-dependences-8333393 > - Reorganize after comments from review > - Rewording of semantics > - Clarifications after comments from Roberto > - Update src/hotspot/share/opto/gcm.cpp > > Co-authored-by: Roberto Casta?eda Lozano > - Minor comment updates > - Add more documentation of the change (with examples) in comments > - Add example in comment > - Fix comma splice in comment > - Update after comments > - ... and 3 more: https://git.openjdk.org/jdk/compare/390c70b4...e5f928cc Nice summary in the code comments! I agree with the proposed point fix solution and revisit this again later. test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java line 2: > 1: /* > 2: * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved. Since you opened the PR in the old year, you can probably just append 2025 here as well instead of replacing it. Suggestion: * Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22852#pullrequestreview-2592367987 PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1940893380 From thartmann at openjdk.org Tue Feb 4 12:04:15 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 4 Feb 2025 12:04:15 GMT Subject: RFR: 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion [v2] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 14:43:30 GMT, Roland Westrelin wrote: >> I investigated the failure from the `Test.java` that's attached to the >> bug. The failure with this test is only reproducible up to 8334060 >> (Implementation of Late Barrier Expansion for G1) so experiments I >> describe here are from the source code for the commit right before it. >> >> Peak malloc memory usage reported by NMT is: 1.3GB >> >> `PhaseCFG::global_code_motion()`, when `OptoRegScheduling` is true, >> creates a `PhaseIFG` that's, when initialized, allocates `_adjs`: a >> `maxlrg` array of `IndexSet`s that can contain up to `maxlrg`. >> >> `maxlrg` in this case is 122839. An `IndexSet` is an array of pointers >> to a 256 bit bitset: one `IndexSet` array needs: >> >> >> 122839 / 256 * 8 = 3832 >> >> >> and there are of 122839: >> >> >> 3832 * 122839 = ~470 MB >> >> >> It turns out the `PhaseIFG` object when used from >> `PhaseCFG::global_code_motion()` doesn't even use the `_adjs` >> array. So a patch like: >> >> >> diff --git a/src/hotspot/share/opto/chaitin.hpp b/src/hotspot/share/opto/chaitin.hpp >> index cf02deb6019..4e5333bf181 100644 >> --- a/src/hotspot/share/opto/chaitin.hpp >> +++ b/src/hotspot/share/opto/chaitin.hpp >> @@ -258,7 +258,7 @@ class PhaseIFG : public Phase { >> VectorSet *_yanked; >> >> PhaseIFG( Arena *arena ); >> - void init( uint maxlrg ); >> + void init( uint maxlrg, bool no_adjs = false ); >> >> // Add edge between a and b. Returns true if actually added. >> int add_edge( uint a, uint b ); >> diff --git a/src/hotspot/share/opto/gcm.cpp b/src/hotspot/share/opto/gcm.cpp >> index ebdefe597ff..fefd75a88c5 100644 >> --- a/src/hotspot/share/opto/gcm.cpp >> +++ b/src/hotspot/share/opto/gcm.cpp >> @@ -1704,7 +1704,9 @@ void PhaseCFG::global_code_motion() { >> rm_live.reset_to_mark(); // Reclaim working storage >> IndexSet::reset_memory(C, &live_arena); >> uint node_size = regalloc._lrg_map.max_lrg_id(); >> - ifg.init(node_size); // Empty IFG >> + ifg.init(node_size, true); // Empty IFG >> regalloc.set_ifg(ifg); >> regalloc.set_live(live); >> regalloc.gather_lrg_masks(false); // Collect LRG masks >> diff --git a/src/hotspot/share/opto/ifg.cpp b/src/hotspot/share/opto/ifg.cpp >> index d12698121b9..e42121c2254 100644 >> --- a/src/hotspot/share/opto/ifg.cpp >> +++ b/src/hotspot/share/opto/ifg.cpp >> @@ -42,18 +42,24 @@ >> PhaseIFG::PhaseIFG( Arena *arena ) : Phase(Interference_Graph), _arena(arena) { >> } >> >> -void PhaseIFG::init( uint maxlrg ) { >> +void PhaseIFG::init( uint maxlrg, bool no_adjs ) { >> ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-8333697 > - fix All clean. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23075#issuecomment-2633699115 From dlunden at openjdk.org Tue Feb 4 12:29:34 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 4 Feb 2025 12:29:34 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: > When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks. > > #### Example 1 > > Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 _after_ a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value). > > ![failure-graph-1](https://github.com/user-attachments/assets/e5458646-7a5c-40e1-b1d8-e3f101e29b73) > ![failure-blocks-1](https://github.com/user-attachments/assets/a0b1f724-0809-4b2f-9feb-93e9c59a5d6a) > > #### Example 2 > > There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9. > > ![failure-graph-2](https://github.com/user-attachments/assets/ede0c299-6251-4ff8-8b84-af40a1ee9e8c) > ![failure-blocks-2](https://github.com/user-attachments/assets/e5a87e43-b6fe-4fa3-8961-54752f63633e) > > ### Changeset > > - Update `PhaseCFG::insert... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22852/files - new: https://git.openjdk.org/jdk/pull/22852/files/e5f928cc..d70337c0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22852&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22852&range=11-12 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/22852.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22852/head:pull/22852 PR: https://git.openjdk.org/jdk/pull/22852 From rcastanedalo at openjdk.org Tue Feb 4 12:35:19 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 4 Feb 2025 12:35:19 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: <029IvJxITvXVeQuuu0Ub0HiX5_WWfKIqNxexY_m0UP0=.eae9a766-9aa1-4159-921d-c20a3e1c6618@github.com> On Tue, 4 Feb 2025 12:29:34 GMT, Daniel Lund?n wrote: >> When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks. >> >> #### Example 1 >> >> Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 _after_ a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value). >> >> ![failure-graph-1](https://github.com/user-attachments/assets/e5458646-7a5c-40e1-b1d8-e3f101e29b73) >> ![failure-blocks-1](https://github.com/user-attachments/assets/a0b1f724-0809-4b2f-9feb-93e9c59a5d6a) >> >> #### Example 2 >> >> There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9. >> >> ![failure-graph-2](https://github.com/user-attachments/assets/ede0c299-6251-4ff8-8b84-af40a1ee9e8c) >> ![failure-blocks-2](https://github.com/user-attachments/assets/e5a87e43-b6fe-4fa3-8961-54752f63633e) >> >> ### Cha... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java > > Co-authored-by: Christian Hagedorn Marked as reviewed by rcastanedalo (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/22852#pullrequestreview-2592692642 From roland at openjdk.org Tue Feb 4 13:06:24 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 4 Feb 2025 13:06:24 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does Message-ID: This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and `Value` because the `int` and `long` versions are very similar and so there's no logic duplication. In the process, support for some extra transformations is added to `RShiftL`. I also added some new test cases. ------------- Commit messages: - fix & test Changes: https://git.openjdk.org/jdk/pull/23438/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349361 Stats: 414 lines in 8 files changed: 255 ins; 127 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From adinn at openjdk.org Tue Feb 4 13:48:41 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 4 Feb 2025 13:48:41 GMT Subject: RFR: 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory Message-ID: ?assert(allocates2(pc)) failed: not in CodeBuffer memory The StubGenenerator compiler blob runs out of space when TestCodeEntryAlignment is run on macos/x86_64 on an avx2-only CPU. This only happens in the worst case with command line options `-XX:CodeCacheSegmentSize=1024 -XX:CodeEntryAlignment=1024`. On linux/x86_64 the test succeeds in that worst case when run on an avx512-enabled CPU but with only 980 bytes of headroom. This patch increments the buffer size on x86_64 to ensure both the avx2 and avx3 cases have enough headroom. n.b. the increment has deliberately been made x86_64-specific rather than macos-specific, even though this problem manifests when testing MacOS and does not manifest when testing Linux. The disparity in generated stubs size actually relates to the capabilities of the CPU and is independent of OS. ------------- Commit messages: - 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory Changes: https://git.openjdk.org/jdk/pull/23439/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23439&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349102 Stats: 3 lines in 2 files changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23439.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23439/head:pull/23439 PR: https://git.openjdk.org/jdk/pull/23439 From adinn at openjdk.org Tue Feb 4 13:51:13 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 4 Feb 2025 13:51:13 GMT Subject: RFR: 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: Message-ID: <8L4H2FH6qe-bAqs6_n3AajEgekpJeul_gvOoZ7BUfgk=.395f7c3f-3d5a-4b5d-bea9-10ae6f692f1a@github.com> On Tue, 4 Feb 2025 13:44:10 GMT, Andrew Dinn wrote: > ?assert(allocates2(pc)) failed: not in CodeBuffer memory > > The StubGenenerator compiler blob runs out of space when TestCodeEntryAlignment is run on macos/x86_64 on an avx2-only CPU. This only happens in the worst case with command line options `-XX:CodeCacheSegmentSize=1024 -XX:CodeEntryAlignment=1024`. > > On linux/x86_64 the test succeeds in that worst case when run on an avx512-enabled CPU but with only 980 bytes of headroom. > > This patch increments the buffer size on x86_64 to ensure both the avx2 and avx3 cases have enough headroom. > > n.b. the increment has deliberately been made x86_64-specific rather than macos-specific, even though this problem manifests when testing MacOS and does not manifest when testing Linux. The disparity in generated stubs size actually relates to the capabilities of the CPU and is independent of OS. @TobiHartmann I estimated the required increment at 4000 bytes. Let's see if it is enough. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23439#issuecomment-2634023369 From fyang at openjdk.org Tue Feb 4 14:06:17 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 4 Feb 2025 14:06:17 GMT Subject: RFR: 8347489: RISC-V: Misaligned memory access with COH [v8] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 07:06:22 GMT, Fei Yang wrote: >> Hi, please consider this change. >> >> We have different base_offset for T_BYTE/T_CHAR (4-byte instead of 8-byte aligned) with COH. This causes misaligned memory accesses for several instrinsics like String.Compare or String.Equals. The reason is that we assume 8-byte alignment and process one 8-byte word starting at the first array element for each iteration in the main loop. As a result, we have performance regressions on platforms with slow misaligned memory accesses like Unmatched and Premier P550 SBCs. >> >> PS: Same issue is there even without COH. base_offset for T_BYTE/T_CHAR is 20 (thus 4-byte aligned) when `UseCompressedClassPointers` is disabled in this case. >> >> Correctness test on linux-riscv64: >> - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (release) >> - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (release) >> - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (fastdebug) >> - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (fastdebug) >> >> Performance test on Premier P550 (-XX:+AlwaysPreTouch -Xms8g -Xmx8g): >> >> SPECjbb2005: >> >> 1. Without Patch >> 1.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32666 >> 1.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 27610 >> 1.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30911 >> 1.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 26008 >> >> 2. With Patch >> 2.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32820 >> 2.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 34179 >> 2.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30620 >> 2.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 31936 >> >> >> SPECjbb2015: >> >> 1. Without Patch >> 1.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1444, critical-jOPS = 431 >> 1.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1092, critical-jOPS = 335 >> >> 2. With Patch >> 2.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1452, critical-jOPS = 419 >> 2.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1438, critical-jOPS = 477 > > Fei Yang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Review comment > - Merge branch 'master' into JDK-8347489 > - Merge branch 'master' into JDK-8347489 > - Review comment > - Review comment > - Merge branch 'master' into JDK-8347489 > - Merge branch 'master' into JDK-8347489 > - Comment > - Fix assertions > - Add assertions > - ... and 2 more: https://git.openjdk.org/jdk/compare/7ea176d7...db326650 Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23053#issuecomment-2634059393 From fyang at openjdk.org Tue Feb 4 14:06:18 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 4 Feb 2025 14:06:18 GMT Subject: Integrated: 8347489: RISC-V: Misaligned memory access with COH In-Reply-To: References: Message-ID: <2iCyXLBz372aOTIe6S2O4-rh8LELWyUfFzPrUpj0oeY=.c840179f-cdc1-48e2-84fb-ac4bd45dadb9@github.com> On Sun, 12 Jan 2025 03:45:45 GMT, Fei Yang wrote: > Hi, please consider this change. > > We have different base_offset for T_BYTE/T_CHAR (4-byte instead of 8-byte aligned) with COH. This causes misaligned memory accesses for several instrinsics like String.Compare or String.Equals. The reason is that we assume 8-byte alignment and process one 8-byte word starting at the first array element for each iteration in the main loop. As a result, we have performance regressions on platforms with slow misaligned memory accesses like Unmatched and Premier P550 SBCs. > > PS: Same issue is there even without COH. base_offset for T_BYTE/T_CHAR is 20 (thus 4-byte aligned) when `UseCompressedClassPointers` is disabled in this case. > > Correctness test on linux-riscv64: > - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (release) > - [x] tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (release) > - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseCompactObjectHeaders") (fastdebug) > - [x] hotspot:tier1 (TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:-UseCompactObjectHeaders") (fastdebug) > > Performance test on Premier P550 (-XX:+AlwaysPreTouch -Xms8g -Xmx8g): > > SPECjbb2005: > > 1. Without Patch > 1.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32666 > 1.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 27610 > 1.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30911 > 1.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 26008 > > 2. With Patch > 2.1 -XX:+UseParallelGC -XX:-UseCompactObjectHeaders: 32820 > 2.2 -XX:+UseParallelGC -XX:+UseCompactObjectHeaders: 34179 > 2.3 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: 30620 > 2.4 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: 31936 > > > SPECjbb2015: > > 1. Without Patch > 1.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1444, critical-jOPS = 431 > 1.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1092, critical-jOPS = 335 > > 2. With Patch > 2.1 -XX:+UseG1GC -XX:-UseCompactObjectHeaders: max-jOPS = 1452, critical-jOPS = 419 > 2.2 -XX:+UseG1GC -XX:+UseCompactObjectHeaders: max-jOPS = 1438, critical-jOPS = 477 This pull request has now been integrated. Changeset: e91a6ec4 Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/e91a6ec49c80ea53bb6f1eb43c924f188803de7e Stats: 132 lines in 3 files changed: 106 ins; 2 del; 24 mod 8347489: RISC-V: Misaligned memory access with COH Reviewed-by: mli, vkempik ------------- PR: https://git.openjdk.org/jdk/pull/23053 From chagedorn at openjdk.org Tue Feb 4 14:09:15 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 4 Feb 2025 14:09:15 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 12:29:34 GMT, Daniel Lund?n wrote: >> When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks. >> >> #### Example 1 >> >> Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 _after_ a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value). >> >> ![failure-graph-1](https://github.com/user-attachments/assets/e5458646-7a5c-40e1-b1d8-e3f101e29b73) >> ![failure-blocks-1](https://github.com/user-attachments/assets/a0b1f724-0809-4b2f-9feb-93e9c59a5d6a) >> >> #### Example 2 >> >> There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9. >> >> ![failure-graph-2](https://github.com/user-attachments/assets/ede0c299-6251-4ff8-8b84-af40a1ee9e8c) >> ![failure-blocks-2](https://github.com/user-attachments/assets/e5a87e43-b6fe-4fa3-8961-54752f63633e) >> >> ### Cha... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java > > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/22852#pullrequestreview-2592972757 From bkilambi at openjdk.org Tue Feb 4 14:45:09 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 4 Feb 2025 14:45:09 GMT Subject: RFR: 8348659: AArch64: IR rule failure with compiler/loopopts/superword/TestSplitPacks.java In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 06:48:42 GMT, Emanuel Peter wrote: >> Hi @eme64 , can you please review this patch as well? Thanks :) > > @Bhavana-Kilambi The patch looks good to me. I've launched some testing, just in case. Please ping me in 24h for an update ;) Hi @eme64 , any update with the testing please ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23385#issuecomment-2634188034 From rriggs at openjdk.org Tue Feb 4 15:06:18 2025 From: rriggs at openjdk.org (Roger Riggs) Date: Tue, 4 Feb 2025 15:06:18 GMT Subject: [jdk24] RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 18:41:44 GMT, Jaikiran Pai wrote: > Can I please get a review of this backport of https://github.com/openjdk/jdk/pull/23420 into jdk24? > > This proposes to bring in those same backouts into `jdk24` to prevent the issue noted in that PR description. jdk24 is in rampdown and this backport will require an approval. A approval request has been raised in https://bugs.openjdk.org/browse/JDK-8349183?focusedId=14746841&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14746841. > > This backport into jdk24 wasn't clean due to a trivial merge conflict in `StringLatin1.java` file. That merge conflict was manually resolved (just like it was done against mainline). The git commands used to create this backport against jdk24 branch are: > > > > git cherry-pick --no-commit 618c5eb27b4c719afd577b690e6bcb21a45fcb0d > > git commit -m 'Backport 618c5eb27b4c719afd577b690e6bcb21a45fcb0d' > > > tier1, tier2 and tier3 testing is currently in progress with this change. Looks good, thanks. A second reviewer might be useful. ------------- Marked as reviewed by rriggs (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23425#pullrequestreview-2593158723 From duke at openjdk.org Tue Feb 4 15:59:09 2025 From: duke at openjdk.org (Abdelhak Zaaim) Date: Tue, 4 Feb 2025 15:59:09 GMT Subject: RFR: 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: Message-ID: <9CeJz-_tg_wKxzjynNaO7g4cNo5P-yC_o8BSn4LXMvY=.75110d5b-63b4-485d-8113-4d468c3a3b67@github.com> On Tue, 4 Feb 2025 13:44:10 GMT, Andrew Dinn wrote: > ?assert(allocates2(pc)) failed: not in CodeBuffer memory > > The StubGenenerator compiler blob runs out of space when TestCodeEntryAlignment is run on macos/x86_64 on an avx2-only CPU. This only happens in the worst case with command line options `-XX:CodeCacheSegmentSize=1024 -XX:CodeEntryAlignment=1024`. > > On linux/x86_64 the test succeeds in that worst case when run on an avx512-enabled CPU but with only 980 bytes of headroom. > > This patch increments the buffer size on x86_64 to ensure both the avx2 and avx3 cases have enough headroom. > > n.b. the increment has deliberately been made x86_64-specific rather than macos-specific, even though this problem manifests when testing MacOS and does not manifest when testing Linux. The disparity in generated stubs size actually relates to the capabilities of the CPU and is independent of OS. Marked as reviewed by abdelhak-zaaim at github.com (no known OpenJDK username). ------------- PR Review: https://git.openjdk.org/jdk/pull/23439#pullrequestreview-2593317481 From jnimeh at openjdk.org Tue Feb 4 16:31:17 2025 From: jnimeh at openjdk.org (Jamil Nimeh) Date: Tue, 4 Feb 2025 16:31:17 GMT Subject: Integrated: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64 In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 16:48:09 GMT, Jamil Nimeh wrote: > This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64. Assembly language profiling yielded an 11% improvement in throughput. When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement. This pull request has now been integrated. Changeset: ee4caa41 Author: Jamil Nimeh URL: https://git.openjdk.org/jdk/commit/ee4caa4180e76911ee75148583c2923f847f8605 Stats: 166 lines in 1 file changed: 71 ins; 1 del; 94 mod 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64 Reviewed-by: aph ------------- PR: https://git.openjdk.org/jdk/pull/23397 From never at openjdk.org Tue Feb 4 16:36:26 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 16:36:26 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash Message-ID: This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. ------------- Commit messages: - 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash Changes: https://git.openjdk.org/jdk/pull/23444/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349374 Stats: 8 lines in 1 file changed: 5 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23444.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444 PR: https://git.openjdk.org/jdk/pull/23444 From qamai at openjdk.org Tue Feb 4 16:49:10 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Feb 2025 16:49:10 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 06:14:11 GMT, Jatin Bhateja wrote: >> Could you instead do this by trying to transform `AndI(MoveF2I(x), MoveF2I(y))` into `AndF(x, y)` instead? > >> Could you instead do this by trying to transform `AndI(MoveF2I(x), MoveF2I(y))` into `AndF(x, y)` instead? > > @merykitty , this patch does not break existing IR invariants as multiple targets already emit efficient instruction sequences for it, we have just improved upon the x86-backed implementation. > ![image](https://github.com/user-attachments/assets/61845793-ca3a-4ad2-8ee8-210f8a1bc60d) > > > Introducing another new IR "AndF" will again need changes in auto-vectorizer. @jatin-bhateja Doing the transformation to `AndF` would be a more general solution and thus better. > Introducing another new IR "AndF" will again need changes in auto-vectorizer. But currently, `CopySign` and `MoveF2I` are not vectorized anyway so we can do the vectorization of `AndF` in a separate patch without much hassle. `AndF` is vectorized into existing `AndV` nicely so it is not a too complicated work. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2634524019 From qamai at openjdk.org Tue Feb 4 16:54:13 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Feb 2025 16:54:13 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 06:14:11 GMT, Jatin Bhateja wrote: > this patch does not break existing IR invariants Also, what invariant can be broken by transforming `AndI(MoveF2I(x), MoveF2I(y)` into `MoveF2I(AndF(x, y))`? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2634535019 From qamai at openjdk.org Tue Feb 4 16:56:17 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Feb 2025 16:56:17 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Sun, 2 Feb 2025 21:36:03 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > jlong, not long LGTM ------------- Marked as reviewed by qamai (Committer). PR Review: https://git.openjdk.org/jdk/pull/22856#pullrequestreview-2593484743 From dnsimon at openjdk.org Tue Feb 4 16:58:09 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Feb 2025 16:58:09 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez wrote: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java line 179: > 177: } > 178: > 179: if (UnsafeAccess.UNSAFE.getLong(getFailedSpeculationsAddress()) != 0) { It's still possible for `getFailedSpeculationsAddress()` to return 0 (i.e. when `managesFailedSpeculations` is `false`). So I think this should be: diff --git a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java index fd46e281c3b..a861c00d77d 100644 --- a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java +++ b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java @@ -171,8 +171,9 @@ public String toString() { @Override public void collectFailedSpeculations() { - if (failedSpeculationsAddress != 0 && UnsafeAccess.UNSAFE.getLong(failedSpeculationsAddress) != 0) { - failedSpeculations = compilerToVM().getFailedSpeculations(failedSpeculationsAddress, failedSpeculations); + long address = getFailedSpeculationsAddress(); + if (address != 0 && UnsafeAccess.UNSAFE.getLong(address) != 0) { + failedSpeculations = compilerToVM().getFailedSpeculations(address, failedSpeculations); assert failedSpeculations.getClass() == byte[][].class; } } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941551882 From chagedorn at openjdk.org Tue Feb 4 17:15:18 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 4 Feb 2025 17:15:18 GMT Subject: RFR: 8346777: Remove unneeded ReplaceInitAndStrideStrategy and add missing const declarations In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 09:43:12 GMT, Christian Hagedorn wrote: > This patch's main goal is to remove the unneeded `ReplaceInitAndStrideStrategy`. We can use the existing `ReplaceInitAndCloneStrideStrategy` instead. The reason behind is that when splitting a loop as part of a loop optimization, we are always keeping the stride the same (i.e. can use `ReplaceInitAndCloneStrideStrategy`) except for when unrolling a loop. In that case, we keep the init value and update the stride instead. This is done with `UpdateStrideForAssertionPredicates`. `ReplaceInitAndCloneStrideStrategy` was used as an intermediate step while applying more refactorings. It's now time to abandon it. > > I also cover other mostly minor things with this change: > - Adding missing `const` declarations. > - Renaming `ctrl` -> `control` > - Swapping order of parameters > > Thanks, > Christian Some last minute updates caused some failures. Moving to draft to look into it again tomorrow. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23434#issuecomment-2634582083 From kvn at openjdk.org Tue Feb 4 17:30:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Feb 2025 17:30:19 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: <_3Sc6qth_bahQb2eOsLLf0mb9ATrPHGwM6GedAOAUyU=.bc9fc68f-d573-48af-a4ab-f9260422c460@github.com> References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <_3Sc6qth_bahQb2eOsLLf0mb9ATrPHGwM6GedAOAUyU=.bc9fc68f-d573-48af-a4ab-f9260422c460@github.com> Message-ID: On Fri, 31 Jan 2025 06:11:41 GMT, Emanuel Peter wrote: >> A quick summary: >> - In [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126), we decided that we are only going to allow irreducible loops that were detected at parsing, and we can thus restrict optimizations to reducible loops which would be difficult to do correct with irreducible loops. That's why we added that assert that checks that no new irreducible loop shows up during compilation. >> - Problem: we use `split_if` for `IfNode::Ideal_common` to split through a Region that is loop-head, and the splitting of the Region introduces a second loop entry -> irreducible loop. >> >> Before `split_if`: >> ![image](https://github.com/user-attachments/assets/01bc78fa-7fed-4a8f-b6f4-078dac9b5dc4) >> >> After `split_if`: >> ![image](https://github.com/user-attachments/assets/1e3bd08e-b76d-4e7f-813e-27a5a22cb2bd) >> >> >> - We have the `split_if` for `IfNode::Ideal_common` to do split-if on straight-line code. But we currently execute this before loop-opts, and so we don't know if the region we split through is actually a loop head. We guard against LoopNode, but a Region only becomes a LoopNode in loop-opts. >> - We also have split-if in loop-opts, which is more careful about splitting through loop-heads. >> - Just removing the straight-line split-if probably leads to a regression, as the loop-opts version only executes if there are loops for example. >> - We could consider delaying the straight-line split-if until after loop-opts. But I don't know if that could lead to regressions in any way. >> >> I discussed this temporary solution with @TobiHartmann : >> - We would like [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) to be unblocked for @shipilev . >> - Convert the assert into a bailout-check, so we are sure we behave correctly in product. Compiling with irreducible loops behaves correctly in almost all cases, but there could be exceptions. >> - For now, have the assert behind a Verify flag, so that [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) is unblocked. Later, we can remove the Verify flag and alway enable the assert again. >> - This fix also looks easier to backport. >> >> ----------------------- >> >> The attached regression test now does **NOT** fail by default, but rather silently bails out of compilation. >> >> With the new debug flag `-XX:+VerifyNoNewIrreducibleLoops`, we still hit the assert, as expected: >> >> # Internal Error (/oracle-work/jdk-fork0/open/src/hotspot/share/opto/loopnode.cpp:5636), pid=3698055, tid=3698072 >> # asser... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > update for Vladimir Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23363#pullrequestreview-2593565697 From kvn at openjdk.org Tue Feb 4 17:30:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Feb 2025 17:30:19 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: <8B6TY4A1E4255-ggl7DI31oerizDCuI6Jlf6vo98oUA=.afc1fc06-ea89-461a-a992-bf22afa2ea6f@github.com> References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <-c7xXeuSN-6QD-k6MA1-7Cv17ztENnt7Q0U6PprRrf0=.afd866ca-f6d8-4f09-8c66-ada7bd38c67f@github.com> <8B6TY4A1E4255-ggl7DI31oerizDCuI6Jlf6vo98oUA=.afc1fc06-ea89-461a-a992-bf22afa2ea6f@github.com> Message-ID: <_xp8xDtwfGz81LSKbTY08BciQnxuN7x6jiz68eSTCRo=.11c92a2e-ffef-478c-a96b-4c40248352eb@github.com> On Tue, 4 Feb 2025 07:20:22 GMT, Emanuel Peter wrote: >> First, I am fine with this "band-aid" change. I understand that it simple replaces assert with bailout which is fine. >> But I am trying to understand what it does. >> >> There are few states when we come to this part of code: >> - `l` is or not marked as irreducible >> - `m` is or not marked with MaybeIrreducibleEntry (is it set only for not Loop?) >> - `m` is or not Loop >> >> So we have 8 combinations. I would like to hear reasons in which cases we should bailout and in which not. > > @vnkozlov Does that help you, or do you have more questions? Yes, thank you for explaining this to me. I think I got it finally. So default `Region` state is `NeverIrreducibleEntry` and it set to `MaybeIrreducibleEntry` when we find it inside irreducible loop during parsing. And that is the only valid state (MaybeIrreducibleEntry) we allow in this part of code. Yes, `Loop` node can't be irreducible and we should not allow it here too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1941597842 From liach at openjdk.org Tue Feb 4 17:35:18 2025 From: liach at openjdk.org (Chen Liang) Date: Tue, 4 Feb 2025 17:35:18 GMT Subject: [jdk24] RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null In-Reply-To: References: Message-ID: <4rawtnCH36CRxVore9tgTbpYYRiEcLo6R49z1bBpJGQ=.bbe418af-c9f0-4287-aa98-794a54572410@github.com> On Mon, 3 Feb 2025 18:41:44 GMT, Jaikiran Pai wrote: > Can I please get a review of this backport of https://github.com/openjdk/jdk/pull/23420 into jdk24? > > This proposes to bring in those same backouts into `jdk24` to prevent the issue noted in that PR description. jdk24 is in rampdown and this backport will require an approval. A approval request has been raised in https://bugs.openjdk.org/browse/JDK-8349183?focusedId=14746841&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14746841. > > This backport into jdk24 wasn't clean due to a trivial merge conflict in `StringLatin1.java` file. That merge conflict was manually resolved (just like it was done against mainline). The git commands used to create this backport against jdk24 branch are: > > > > git cherry-pick --no-commit 618c5eb27b4c719afd577b690e6bcb21a45fcb0d > > git commit -m 'Backport 618c5eb27b4c719afd577b690e6bcb21a45fcb0d' > > > tier1, tier2 and tier3 testing is currently in progress with this change. Marked as reviewed by liach (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23425#pullrequestreview-2593574823 From never at openjdk.org Tue Feb 4 17:41:14 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 17:41:14 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 16:54:58 GMT, Doug Simon wrote: >> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. > > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java line 179: > >> 177: } >> 178: >> 179: if (UnsafeAccess.UNSAFE.getLong(getFailedSpeculationsAddress()) != 0) { > > It's still possible for `getFailedSpeculationsAddress()` to return 0 (i.e. when `managesFailedSpeculations` is `false`). So I think this should be: > > diff --git a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java > index fd46e281c3b..a861c00d77d 100644 > --- a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java > +++ b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java > @@ -171,8 +171,9 @@ public String toString() { > > @Override > public void collectFailedSpeculations() { > - if (failedSpeculationsAddress != 0 && UnsafeAccess.UNSAFE.getLong(failedSpeculationsAddress) != 0) { > - failedSpeculations = compilerToVM().getFailedSpeculations(failedSpeculationsAddress, failedSpeculations); > + long address = getFailedSpeculationsAddress(); > + if (address != 0 && UnsafeAccess.UNSAFE.getLong(address) != 0) { > + failedSpeculations = compilerToVM().getFailedSpeculations(address, failedSpeculations); > assert failedSpeculations.getClass() == byte[][].class; > } > } I'm filtering out 0 above this line and `getFailedSpeculationsAddress()` can't return 0 if `failedSpeculationsAddress` is already non-zero. `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941612206 From jbhateja at openjdk.org Tue Feb 4 17:44:09 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 17:44:09 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 06:14:11 GMT, Jatin Bhateja wrote: >> Could you instead do this by trying to transform `AndI(MoveF2I(x), MoveF2I(y))` into `AndF(x, y)` instead? > >> Could you instead do this by trying to transform `AndI(MoveF2I(x), MoveF2I(y))` into `AndF(x, y)` instead? > > @merykitty , this patch does not break existing IR invariants as multiple targets already emit efficient instruction sequences for it, we have just improved upon the x86-backed implementation. > ![image](https://github.com/user-attachments/assets/61845793-ca3a-4ad2-8ee8-210f8a1bc60d) > > > Introducing another new IR "AndF" will again need changes in auto-vectorizer. > @jatin-bhateja Doing the transformation to `AndF` would be a more general solution and thus better. > > > Introducing another new IR "AndF" will again need changes in auto-vectorizer. > > But currently, `CopySign` and `MoveF2I` are not vectorized anyway so we can do the vectorization of `AndF` in a separate patch without much hassle. `AndF` is vectorized into existing `AndV` nicely so it is not a too complicated work. Yes, I have a follow-up patch to auto-vectorized CopySign. > > this patch does not break existing IR invariants > > Also, what invariant can be broken by transforming `AndI(MoveF2I(x), MoveF2I(y)` into `MoveF2I(AndF(x, y))`? Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2634649969 From jbhateja at openjdk.org Tue Feb 4 17:50:21 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 17:50:21 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v18] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: On Fri, 31 Jan 2025 07:13:14 GMT, Emanuel Peter wrote: >> Hi @eme64 , I have lowered the feature check to IR annotation for now. > > @jatin-bhateja Launched testing for Commit 17 / v22. Hi @eme64 , Kindly share the results of your test runs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2634662539 From kvn at openjdk.org Tue Feb 4 18:02:15 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Feb 2025 18:02:15 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v2] In-Reply-To: References: Message-ID: On Fri, 24 Jan 2025 17:13:32 GMT, Aleksey Shipilev wrote: >> We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. >> >> I think we need to run CTW in the mode that exposes more code to the compiler optimizations. >> >> Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. **We cannot integrate this PR until that bug is fixed**. But we can discuss if this makes sense, and/or we want some other options included to expand CTW testing. > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Also do markMethodProfiled for extra scope Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23296#pullrequestreview-2593649274 From epeter at openjdk.org Tue Feb 4 18:16:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:16:20 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 13:02:47 GMT, Roland Westrelin wrote: > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Drive-by code style comment ;) src/hotspot/share/opto/mulnode.cpp line 1311: > 1309: } > 1310: > 1311: Node *RShiftNode::IdealIL(PhaseGVN *phase, bool can_reshape, BasicType bt) { Drive-by: fix position of `*` src/hotspot/share/opto/mulnode.cpp line 1314: > 1312: // Inputs may be TOP if they are dead. > 1313: const TypeInteger* t1 = phase->type(in(1))->isa_integer(bt); > 1314: if (!t1) return NodeSentinel; // Left input is an integer Drive-by: don't use implicit null-check, make comparison with `nullptr` explicit. And add curly braces. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23438#pullrequestreview-2593673668 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1941676625 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1941677350 From epeter at openjdk.org Tue Feb 4 18:33:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:33:12 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 04:40:23 GMT, Jasmine Karthikeyan wrote: > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! Great work, looks generally amazing ? I left a few comments below. src/hotspot/share/opto/vtransform.hpp line 525: > 523: }; > 524: > 525: class VTransformCastNode : public VTransformNode { I think it would be good to make it a `VTransformCastVectorNode`. test/hotspot/jtreg/compiler/loopopts/superword/TestSubwordVectorization.java line 53: > 51: for (int i = 0; i < SIZE; i++) { > 52: res[i] = RANDOM.nextInt(); > 53: } Can you please use `Generators.java`? It would be great if we can use that in the future, to create more "interesting" input data ;) test/hotspot/jtreg/compiler/loopopts/superword/TestSubwordVectorization.java line 101: > 99: @IR(applyIfCPUFeature = { "avx", "true" }, > 100: applyIfOr = {"AlignVector", "false", "UseCompactObjectHeaders", "false"}, > 101: counts = { IRNode.VECTOR_CAST_I2B, IRNode.VECTOR_SIZE_ANY, ">0" }) Hmm. It would be great if we could assert the length of the vector. Otherwise we may get less than we could. Something like this may work: `IRNode.VECTOR_SIZE + "min(max_int, max_byte)"` Have you tried that yet? test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 154: > 152: @IR(applyIfCPUFeature = { "avx", "true" }, > 153: applyIfOr = {"AlignVector", "false", "UseCompactObjectHeaders", "false"}, > 154: counts = { IRNode.VECTOR_CAST_I2S, IRNode.VECTOR_SIZE + "min(max_int, max_short)", ">0" }) Ah see, here we have the vector size asserted, good! test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 165: > 163: @Test > 164: @IR(failOn = {IRNode.STORE_VECTOR}) > 165: // Subword vector casts do not work currently, see JDK-8342095. There seem to be other cases in this file that mention `JDK-8342095`. We should probably file a new RFE for those, right? test/micro/org/openjdk/bench/vm/compiler/VectorSubword.java line 73: > 71: } > 72: } > 73: } Ah, these are all casting to smaller types. What about casting to larger types? ------------- PR Review: https://git.openjdk.org/jdk/pull/23413#pullrequestreview-2593704858 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1941692475 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1941693919 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1941700026 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1941700871 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1941703352 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1941701910 From epeter at openjdk.org Tue Feb 4 18:38:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:38:11 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v2] In-Reply-To: References: Message-ID: On Fri, 24 Jan 2025 17:13:32 GMT, Aleksey Shipilev wrote: >> We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. >> >> I think we need to run CTW in the mode that exposes more code to the compiler optimizations. >> >> Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. **We cannot integrate this PR until that bug is fixed**. But we can discuss if this makes sense, and/or we want some other options included to expand CTW testing. > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Also do markMethodProfiled for extra scope This is the fix (band-aid) to unlock us here. https://github.com/openjdk/jdk/pull/23363 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2634765746 From epeter at openjdk.org Tue Feb 4 18:47:25 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:47:25 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Thu, 30 Jan 2025 17:11:08 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: >> >> // We are allowed to use the constant type only if cast succeeded >> >> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. >> >> Please take a look and leave your reviews, thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > format Nice cleanup. Though it looks like you are doing more than remove the ctrl input. I don't know the code very well, so I have some questions ;) src/hotspot/share/opto/parseHelper.cpp line 170: > 168: !too_many_traps(Deoptimization::Reason_array_check) && > 169: !tak->klass_is_exact() && > 170: tak->isa_aryklassptr()) { Looks like an implicit `nullptr` check. Not allowed by code style ;) src/hotspot/share/opto/parseHelper.cpp line 193: > 191: // See issue JDK-8057622 for details. > 192: > 193: always_see_exact_class = true; Why is it ok to remove this? If this branch is not taken, it used to be `false`, and would lead to something different below... ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2593742600 PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941714615 PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941719070 From epeter at openjdk.org Tue Feb 4 18:47:25 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:47:25 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Tue, 4 Feb 2025 18:39:32 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> format > > src/hotspot/share/opto/parseHelper.cpp line 170: > >> 168: !too_many_traps(Deoptimization::Reason_array_check) && >> 169: !tak->klass_is_exact() && >> 170: tak->isa_aryklassptr()) { > > Looks like an implicit `nullptr` check. Not allowed by code style ;) Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941715309 From epeter at openjdk.org Tue Feb 4 18:51:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:51:15 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation In-Reply-To: References: Message-ID: On Fri, 17 Jan 2025 19:35:44 GMT, Mikhail Ablakatov wrote: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length: > > Benchmark (size) Mode Old New Units > Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms > Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms > Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms > Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms > Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms > Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms > > > Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length: > > Benchmark (size) Mode Old New Units > Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms > Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms > Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms > Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms > Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms > Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms This could also be a relevant Benchmark: `./test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java` ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-2634792965 From epeter at openjdk.org Tue Feb 4 18:55:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:55:13 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation In-Reply-To: References: Message-ID: <2jvFY4hq9FPdk9e4Zg6LRPdRVhDTGgxofL-we8c-mns=.4e6ce509-67a4-4e46-a661-2b0951f88731@github.com> On Fri, 17 Jan 2025 19:35:44 GMT, Mikhail Ablakatov wrote: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length: > > Benchmark (size) Mode Old New Units > Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms > Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms > Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms > Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms > Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms > Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms > > > Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length: > > Benchmark (size) Mode Old New Units > Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms > Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms > Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms > Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms > Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms > Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139: > 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD > 2138: // instructions are used. > 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc, Drive-by question: This is recursive folding: take halve the vector and add it that way. What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1941733604 From qamai at openjdk.org Tue Feb 4 18:57:10 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Feb 2025 18:57:10 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com> On Tue, 4 Feb 2025 18:40:05 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/parseHelper.cpp line 170: >> >>> 168: !too_many_traps(Deoptimization::Reason_array_check) && >>> 169: !tak->klass_is_exact() && >>> 170: tak->isa_aryklassptr()) { >> >> Looks like an implicit `nullptr` check. Not allowed by code style ;) > > Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please? > Looks like an implicit nullptr check. Not allowed by code style ;) But the verb here is `isa` and we use these as a `bool` a lot, though :/ > Can you quickly explain this change from tak != TypeInstKlassPtr::OBJECT so I don't need to investigate myself, please? The bottom type of an array can be either `Object` or an array of some kind, so `tak != TypeInstKlassPtr::OBJECT` is the same as `tak->isa_aryklassptr()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941732694 From qamai at openjdk.org Tue Feb 4 18:57:11 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Feb 2025 18:57:11 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Tue, 4 Feb 2025 18:43:04 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> format > > src/hotspot/share/opto/parseHelper.cpp line 193: > >> 191: // See issue JDK-8057622 for details. >> 192: >> 193: always_see_exact_class = true; > > Why is it ok to remove this? > If this branch is not taken, it used to be `false`, and would lead to something different below... The only use of this is to decide if we need to attach a control input to the `LoadKlass`. As the control input is not needed, this can be removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941735400 From epeter at openjdk.org Tue Feb 4 19:01:27 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:01:27 GMT Subject: RFR: 8346836: C2: Introduce a way to verify the correctness of ConstraintCastNodes at runtime [v4] In-Reply-To: References: Message-ID: On Wed, 22 Jan 2025 15:22:43 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch adds a develop flag `VerifyConstraintCasts`, which will verify the correctness of `CastIINode`s and `CastLLNode`s at runtime and crash the VM if the dynamic value lies outside the type value range. >> >> Please take a look, thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > better comments Sorry for the delay. Looks great, especially with the better comments! ? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22880#pullrequestreview-2593784793 From epeter at openjdk.org Tue Feb 4 19:06:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:06:16 GMT Subject: RFR: 8341293: Split field loads through Nested Phis [v7] In-Reply-To: <18TQt6vxN9KxSVwyeQtAWde-ezaVuUEioAl_5_3sAeE=.e5e76fb6-04a7-4f6f-9377-f1e64837ada6@github.com> References: <18TQt6vxN9KxSVwyeQtAWde-ezaVuUEioAl_5_3sAeE=.e5e76fb6-04a7-4f6f-9377-f1e64837ada6@github.com> Message-ID: On Fri, 24 Jan 2025 19:13:13 GMT, Dhamoder Nalla wrote: >> As an extension of the work done as part of https://github.com/openjdk/jdk/pull/12897, split the field loads (AddP -> Load*) with nested phi parent nodes to enable more scalar replacements, thereby reducing memory allocation. >> >> >> Here are the sequence of Ideal graph transformations for Nested phi: >> >> >> ![image](https://github.com/user-attachments/assets/c18e5ca0-c554-475c-814a-7cb288d96569) >> >> ![image](https://github.com/user-attachments/assets/b279b5f2-9ec6-4d9b-a627-506451f1cf81) >> >> ![image](https://github.com/user-attachments/assets/f506b918-2dd0-4dbe-a440-ff253afa3961) >> >> JMH results: >> with disabled RAM >> >> Benchmark Mode Cnt Score Error Units >> NestedPhiAndRematerialize.NopRAM.testBailOut_runner avgt 15 13.969 ? 0.248 ms/op >> NestedPhiAndRematerialize.NopRAM.testFieldEscapeWithMerge_runner avgt 15 80.300 ? 4.306 ms/op >> NestedPhiAndRematerialize.NopRAM.testMerge_TryCatchFinally_runner avgt 15 72.182 ? 1.781 ms/op >> NestedPhiAndRematerialize.NopRAM.testMultiParentPhi_runner avgt 15 2.983 ? 0.001 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhiPolymorphic_runner avgt 15 18.342 ? 0.731 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhiProcessOrder_runner avgt 15 14.315 ? 0.443 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhiWithLambda_runner avgt 15 18.511 ? 1.212 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhiWithTrap_runner avgt 15 66.277 ? 1.478 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhi_FieldLoad_runner avgt 15 17.968 ? 0.306 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhi_TryCatch_runner avgt 15 14.186 ? 0.247 ms/op >> NestedPhiAndRematerialize.NopRAM.testRematerialize_MultiObj_runner avgt 15 88.435 ? 4.869 ms/op >> NestedPhiAndRematerialize.NopRAM.testRematerialize_SingleObj_runner avgt 15 29560.130 ? 48.797 ms/op >> NestedPhiAndRematerialize.NopRAM.testRematerialize_TryCatch_runner avgt 15 49.150 ? 2.307 ms/op >> NestedPhiAndRematerialize.NopRAM.testThreeLevelNestedPhi_runner avgt 15 18.236 ? 0.308 ms/op >> >> with enabled RAM >> Benchmark Mode Cnt Score Error Units >> NestedPhiAndRematerialize.YesRAM.testBailOut_runner avgt 15 3.257 ? 0.423 ms/op >> NestedPhiAndRematerialize.YesRAM.testFieldEscapeWithMerge_runner avgt 15 79.916 ? 3.477 ms/op >> NestedPhiAndRematerialize.YesRAM.testMerge_TryCatchFinally_runner avgt 15 72.053 ? 1.916 ms/op >> NestedPhiAndRematerialize.YesRAM.testMultiParentPhi_runner avgt 15 2.984 ? 0.001 ms/op >> NestedPhiAndRematerialize.YesRAM.testNestedPhiPolymorphic_runner avgt ... > > Dhamoder Nalla has updated the pull request incrementally with one additional commit since the last revision: > > Modify IR rules @dhanalla Would you like this to be reviewed? We generally don't re-review until we get pinged again. The idea is that you are maybe still working on it, and so there is no point in reviewing half-processed code. So once you are happy, you can let us know ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/21270#issuecomment-2634821753 From epeter at openjdk.org Tue Feb 4 19:09:35 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:09:35 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fixing typos Thanks @jatin-bhateja for all your patience, this really took a while ? It looks good to me - again I'm only reviewing the C++ VM changes, so someone else has to review the Java changes. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2593800414 From epeter at openjdk.org Tue Feb 4 19:09:36 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:09:36 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: <7WobCDj_e4Sw1CEYr3EVfgHTxJoxBfiFR63WwrzDDzs=.27e926d0-23e6-4231-a677-fdfd683083be@github.com> On Tue, 4 Feb 2025 09:56:15 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/convertnode.cpp line 971: >> >>> 969: return true; >>> 970: default: >>> 971: return false; >> >> Does this cover all cases? What about `FmaHF`? > > FmaHF is a ternary operation and is intrinsified. Ah, right. My bad ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1941748224 From epeter at openjdk.org Tue Feb 4 19:15:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:15:14 GMT Subject: RFR: 8348659: AArch64: IR rule failure with compiler/loopopts/superword/TestSplitPacks.java In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 11:19:27 GMT, Bhavana Kilambi wrote: > "test5a" in this file fails on Graviton3 (32B, SVE) as the compiler fails to match IR rules for vector size 2. This is because the minimum vector size for aarch64 machines is 8B and it does not support generation of vectors of 2 short values. > > Modified the IR rules to have two separate rules - one for sse4.1 and another for sve. > > The test now passes on Graviton3. Testing all :green_circle: . Thanks for the fix, ? it! Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23385#pullrequestreview-2593810574 PR Review: https://git.openjdk.org/jdk/pull/23385#pullrequestreview-2593811001 From epeter at openjdk.org Tue Feb 4 19:19:30 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:19:30 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 07:30:07 GMT, Matthias Ernst wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> jlong, not long > > Yes, win was failing due to a mixup between long (32bit) and jlong. The last commit fixed the win presubmit for me. @mernst-github Good, testing looks clean on my side too! I'll have a closer took at the C++ VM changes again tomorrow. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2634855996 From liach at openjdk.org Tue Feb 4 19:21:44 2025 From: liach at openjdk.org (Chen Liang) Date: Tue, 4 Feb 2025 19:21:44 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com> On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fixing typos src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 42: > 40: } > 41: > 42: public interface Float16TernaryMathOp { Is there a reason we don't write the default impl explicitly in this class, but ask for a lambda for an implementation? Each intrinsified method only has one default impl, so I think we can just inline that into the method body here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1941764924 From epeter at openjdk.org Tue Feb 4 19:22:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:22:15 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v18] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: <90MwDac7Q83dK8KDagHOst15xV-quGZKVE8n2tP9dsk=.351ed042-9a69-4186-b134-8c3cb6fef6cd@github.com> On Tue, 4 Feb 2025 17:47:29 GMT, Jatin Bhateja wrote: >> @jatin-bhateja Launched testing for Commit 17 / v22. > > Hi @eme64 , Kindly share the results of your test runs. @jatin-bhateja Tests look all good on my side. I'll make another pass in the next few days, and hopefully approve. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2634863422 From kvn at openjdk.org Tue Feb 4 19:28:17 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Feb 2025 19:28:17 GMT Subject: RFR: 8346836: C2: Introduce a way to verify the correctness of ConstraintCastNodes at runtime [v4] In-Reply-To: References: Message-ID: On Wed, 22 Jan 2025 15:22:43 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch adds a develop flag `VerifyConstraintCasts`, which will verify the correctness of `CastIINode`s and `CastLLNode`s at runtime and crash the VM if the dynamic value lies outside the type value range. >> >> Please take a look, thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > better comments Can we add AArch64 implementation too to cover our platforms? ------------- PR Review: https://git.openjdk.org/jdk/pull/22880#pullrequestreview-2593839063 From shade at openjdk.org Tue Feb 4 19:35:12 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 4 Feb 2025 19:35:12 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v3] In-Reply-To: References: Message-ID: > We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. > > I think we need to run CTW in the mode that exposes more code to the compiler optimizations. > > Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. **We cannot integrate this PR until that bug is fixed**. But we can discuss if this makes sense, and/or we want some other options included to expand CTW testing. Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps - Also do markMethodProfiled for extra scope - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23296/files - new: https://git.openjdk.org/jdk/pull/23296/files/5247bec7..78fa50c4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23296&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23296&range=01-02 Stats: 40733 lines in 2955 files changed: 18556 ins; 12154 del; 10023 mod Patch: https://git.openjdk.org/jdk/pull/23296.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23296/head:pull/23296 PR: https://git.openjdk.org/jdk/pull/23296 From shade at openjdk.org Tue Feb 4 19:35:13 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 4 Feb 2025 19:35:13 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v2] In-Reply-To: References: Message-ID: On Fri, 24 Jan 2025 17:13:32 GMT, Aleksey Shipilev wrote: >> We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. >> >> I think we need to run CTW in the mode that exposes more code to the compiler optimizations. >> >> Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. **We cannot integrate this PR until that bug is fixed**. But we can discuss if this makes sense, and/or we want some other options included to expand CTW testing. > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Also do markMethodProfiled for extra scope > This is the fix (band-aid) to unlock us here. #23363 Ack. Retesting. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2634882080 From shade at openjdk.org Tue Feb 4 19:37:26 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 4 Feb 2025 19:37:26 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 19:30:01 GMT, Aleksey Shipilev wrote: > > This is the fix (band-aid) to unlock us here. #23363 > > Ack. Retesting. Nevermind. I thought it was integrated. I'll wait some more :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2634891840 From dnsimon at openjdk.org Tue Feb 4 19:39:15 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Feb 2025 19:39:15 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez wrote: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. Marked as reviewed by dnsimon (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23444#pullrequestreview-2593861407 From dnsimon at openjdk.org Tue Feb 4 19:39:16 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Feb 2025 19:39:16 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 17:38:40 GMT, Tom Rodriguez wrote: > `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case. Ok, I'd forgotten about that invariant. Might be worth reminding the reader of it with a comment in collectFailedSpeculations. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941785813 From jkarthikeyan at openjdk.org Tue Feb 4 19:43:32 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 4 Feb 2025 19:43:32 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 18:27:03 GMT, Emanuel Peter wrote: >> Hi all, >> This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: >> >> >> Baseline Patch >> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement >> VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) >> VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) >> VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) >> >> >> I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! > > test/hotspot/jtreg/compiler/loopopts/superword/TestSubwordVectorization.java line 101: > >> 99: @IR(applyIfCPUFeature = { "avx", "true" }, >> 100: applyIfOr = {"AlignVector", "false", "UseCompactObjectHeaders", "false"}, >> 101: counts = { IRNode.VECTOR_CAST_I2B, IRNode.VECTOR_SIZE_ANY, ">0" }) > > Hmm. It would be great if we could assert the length of the vector. Otherwise we may get less than we could. > > Something like this may work: > `IRNode.VECTOR_SIZE + "min(max_int, max_byte)"` > > Have you tried that yet? I think I wrote this test before realizing you can do size checks like in `ArrayTypeConvertTest.java`, I'll make sure to change it to the same system there. > test/micro/org/openjdk/bench/vm/compiler/VectorSubword.java line 73: > >> 71: } >> 72: } >> 73: } > > Ah, these are all casting to smaller types. What about casting to larger types? Originally I thought that larger type conversions were only available on AVX-512 based on the ad-file, but reading it more carefully now I see that they are indeed supported for AVX! I'll go ahead and generalize the changes further to make casting to larger types supported as well. I think this leaves `char` as the only type not supported, which I can look at in a follow-up RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1941790606 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1941790789 From never at openjdk.org Tue Feb 4 20:52:37 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 20:52:37 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v2] In-Reply-To: References: Message-ID: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: improve javadoc ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23444/files - new: https://git.openjdk.org/jdk/pull/23444/files/aefc1dfd..459f5c36 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23444.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444 PR: https://git.openjdk.org/jdk/pull/23444 From never at openjdk.org Tue Feb 4 20:52:37 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 20:52:37 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 19:36:36 GMT, Doug Simon wrote: >> I'm filtering out 0 above this line and `getFailedSpeculationsAddress()` can't return 0 if `failedSpeculationsAddress` is already non-zero. >> >> `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case. > >> `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case. > > Ok, I'd forgotten about that invariant. Might be worth reminding the reader of it with a comment in collectFailedSpeculations. It's already the style in other places like the call to addFailedSpeculation so I'm not sure it's worth calling out here. I've updated the javadoc for getFailedSpeculationsAddress to specify that it always returns non-zero. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941877712 From never at openjdk.org Tue Feb 4 20:56:53 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 20:56:53 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v3] In-Reply-To: References: Message-ID: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: improve comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23444/files - new: https://git.openjdk.org/jdk/pull/23444/files/459f5c36..5a5fd6fc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=01-02 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23444.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444 PR: https://git.openjdk.org/jdk/pull/23444 From jpai at openjdk.org Wed Feb 5 04:56:14 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Wed, 5 Feb 2025 04:56:14 GMT Subject: [jdk24] RFR: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 18:41:44 GMT, Jaikiran Pai wrote: > Can I please get a review of this backport of https://github.com/openjdk/jdk/pull/23420 into jdk24? > > This proposes to bring in those same backouts into `jdk24` to prevent the issue noted in that PR description. jdk24 is in rampdown and this backport will require an approval. A approval request has been raised in https://bugs.openjdk.org/browse/JDK-8349183?focusedId=14746841&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14746841. > > This backport into jdk24 wasn't clean due to a trivial merge conflict in `StringLatin1.java` file. That merge conflict was manually resolved (just like it was done against mainline). The git commands used to create this backport against jdk24 branch are: > > > > git cherry-pick --no-commit 618c5eb27b4c719afd577b690e6bcb21a45fcb0d > > git commit -m 'Backport 618c5eb27b4c719afd577b690e6bcb21a45fcb0d' > > > tier1, tier2 and tier3 testing is currently in progress with this change. Thank you Roger and Chen for the reviews. Approval for backporting the 2 issues, linked to this PR, into jdk24 has now been granted. I've run tier1 through tier6 with these changes in our CI and that has completed without related issues. tier7 and tier8 are progressing fine too. I'll go ahead and integrate this shortly. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23425#issuecomment-2635699537 From thartmann at openjdk.org Wed Feb 5 06:08:20 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 5 Feb 2025 06:08:20 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v9] In-Reply-To: References: Message-ID: <4NyjmA6xaOMLUYbMqPXgkxZnhtBVj3feMH2Z4wjum5k=.9f70f7ec-a782-43c5-9350-c9e10bd1d3ea@github.com> On Tue, 4 Feb 2025 10:11:36 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 32 commits: > > - TestMemorySegment test fix > - test wip > - Merge branch 'master' into JDK-8342692 > - refactor > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - review > - reviews > - ... and 22 more: https://git.openjdk.org/jdk/compare/3f1d9b57...7dd6fde9 `compiler/escapeAnalysis/TestMissingAntiDependency.java` fails on Windows x64 and Linux AArch64 with `-XX:StressLongCountedLoop=200000000`: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (workspace\open\src\hotspot\share\opto\gcm.cpp:916), pid=35968, tid=34752 # assert(use_mem_state != load->find_exact_control(load->in(0))) failed: dependence cycle found # Current CompileTask: C2:710 98 b 4 TestMissingAntiDependency::test (89 bytes) Stack: [0x0000007bdcb00000,0x0000007bdcc00000], sp=0x0000007bdcbfbba0, free space=1006k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [jvm.dll+0x7d2910] PhaseCFG::insert_anti_dependences+0xe30 (gcm.cpp:916) V [jvm.dll+0x7d591f] PhaseCFG::schedule_late+0x47f (gcm.cpp:1536) V [jvm.dll+0x7d083e] PhaseCFG::global_code_motion+0x31e (gcm.cpp:1650) V [jvm.dll+0x7cf2ad] PhaseCFG::do_global_code_motion+0x6d (gcm.cpp:1780) V [jvm.dll+0x55746d] Compile::Code_Gen+0x19d (compile.cpp:2953) V [jvm.dll+0x555ca0] Compile::Compile+0x11d0 (compile.cpp:882) V [jvm.dll+0x45cfd9] C2Compiler::compile_method+0x179 (c2compiler.cpp:144) V [jvm.dll+0x573a5a] CompileBroker::invoke_compiler_on_method+0x7aa (compileBroker.cpp:2317) V [jvm.dll+0x570fab] CompileBroker::compiler_thread_loop+0x33b (compileBroker.cpp:1976) V [jvm.dll+0x8ba602] JavaThread::thread_main_inner+0x282 (javaThread.cpp:777) V [jvm.dll+0xfa95f4] Thread::call_run+0x1b4 (thread.cpp:236) V [jvm.dll+0xd6ae91] thread_native_entry+0xe1 (os_windows.cpp:566) C [ucrtbase.dll+0x2268a] (no source info available) C [KERNEL32.DLL+0x17ac4] (no source info available) C [ntdll.dll+0x5a8c1] (no source info available) Maybe it's (related to) [JDK-8341976](https://bugs.openjdk.org/browse/JDK-8341976)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-2635779606 From thartmann at openjdk.org Wed Feb 5 06:15:10 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 5 Feb 2025 06:15:10 GMT Subject: RFR: 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 13:44:10 GMT, Andrew Dinn wrote: > ?assert(allocates2(pc)) failed: not in CodeBuffer memory > > The StubGenenerator compiler blob runs out of space when TestCodeEntryAlignment is run on macos/x86_64 on an avx2-only CPU. This only happens in the worst case with command line options `-XX:CodeCacheSegmentSize=1024 -XX:CodeEntryAlignment=1024`. > > On linux/x86_64 the test succeeds in that worst case when run on an avx512-enabled CPU but with only 980 bytes of headroom. > > This patch increments the buffer size on x86_64 to ensure both the avx2 and avx3 cases have enough headroom. > > n.b. the increment has deliberately been made x86_64-specific rather than macos-specific, even though this problem manifests when testing MacOS and does not manifest when testing Linux. The disparity in generated stubs size actually relates to the capabilities of the CPU and is independent of OS. Looks good, I submitted testing and will report back once it passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23439#issuecomment-2635787896 From thartmann at openjdk.org Wed Feb 5 06:26:16 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 5 Feb 2025 06:26:16 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: <_3Sc6qth_bahQb2eOsLLf0mb9ATrPHGwM6GedAOAUyU=.bc9fc68f-d573-48af-a4ab-f9260422c460@github.com> References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <_3Sc6qth_bahQb2eOsLLf0mb9ATrPHGwM6GedAOAUyU=.bc9fc68f-d573-48af-a4ab-f9260422c460@github.com> Message-ID: On Fri, 31 Jan 2025 06:11:41 GMT, Emanuel Peter wrote: >> A quick summary: >> - In [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126), we decided that we are only going to allow irreducible loops that were detected at parsing, and we can thus restrict optimizations to reducible loops which would be difficult to do correct with irreducible loops. That's why we added that assert that checks that no new irreducible loop shows up during compilation. >> - Problem: we use `split_if` for `IfNode::Ideal_common` to split through a Region that is loop-head, and the splitting of the Region introduces a second loop entry -> irreducible loop. >> >> Before `split_if`: >> ![image](https://github.com/user-attachments/assets/01bc78fa-7fed-4a8f-b6f4-078dac9b5dc4) >> >> After `split_if`: >> ![image](https://github.com/user-attachments/assets/1e3bd08e-b76d-4e7f-813e-27a5a22cb2bd) >> >> >> - We have the `split_if` for `IfNode::Ideal_common` to do split-if on straight-line code. But we currently execute this before loop-opts, and so we don't know if the region we split through is actually a loop head. We guard against LoopNode, but a Region only becomes a LoopNode in loop-opts. >> - We also have split-if in loop-opts, which is more careful about splitting through loop-heads. >> - Just removing the straight-line split-if probably leads to a regression, as the loop-opts version only executes if there are loops for example. >> - We could consider delaying the straight-line split-if until after loop-opts. But I don't know if that could lead to regressions in any way. >> >> I discussed this temporary solution with @TobiHartmann : >> - We would like [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) to be unblocked for @shipilev . >> - Convert the assert into a bailout-check, so we are sure we behave correctly in product. Compiling with irreducible loops behaves correctly in almost all cases, but there could be exceptions. >> - For now, have the assert behind a Verify flag, so that [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) is unblocked. Later, we can remove the Verify flag and alway enable the assert again. >> - This fix also looks easier to backport. >> >> ----------------------- >> >> The attached regression test now does **NOT** fail by default, but rather silently bails out of compilation. >> >> With the new debug flag `-XX:+VerifyNoNewIrreducibleLoops`, we still hit the assert, as expected: >> >> # Internal Error (/oracle-work/jdk-fork0/open/src/hotspot/share/opto/loopnode.cpp:5636), pid=3698055, tid=3698072 >> # asser... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > update for Vladimir Nice analysis! Thanks for quickly jumping on this, Emanuel! Looks good to me. src/hotspot/share/opto/loopnode.cpp line 5648: > 5646: if (!head->can_be_irreducible_entry()) { > 5647: assert(!VerifyNoNewIrreducibleLoops, "A new irreducible loop was created after parsing."); > 5648: C->record_method_not_compilable("A new irreducible loop was created after parsing."); If you haven't done that yet, I would suggest to hardcode these bailouts to "always bail" out and run testing to check if the bailout always works. You'll of course get all kinds of test failures but the VM should not crash/assert (you can filter for these in the test results and ignore anything else). test/hotspot/jtreg/compiler/loopopts/TestSplitIfNewIrreducibleLoop.java line 47: > 45: > 46: public static void main(String[] args) { > 47: // Instanciate one each: classes are loaded. Suggestion: // Instantiate one each: classes are loaded. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23363#pullrequestreview-2594724717 PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1942305325 PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1942302878 From thartmann at openjdk.org Wed Feb 5 06:47:14 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 5 Feb 2025 06:47:14 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 13:02:47 GMT, Roland Westrelin wrote: > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Fails to build on Mac AArch64: [2025-02-05T06:43:04,925Z] * For target hotspot_variant-server_libjvm_objs_mulnode.o: [2025-02-05T06:43:04,925Z] [...]workspace/open/src/hotspot/share/opto/mulnode.cpp:1400:13: error: use of bitwise '&' with boolean operands [-Werror,-Wbitwise-instead-of-logical] [2025-02-05T06:43:04,925Z] assert((checked_cast(lo) == lo_verify) & (checked_cast(hi) == hi_verify), "inconsistent"); ------------- PR Comment: https://git.openjdk.org/jdk/pull/23438#issuecomment-2635828033 From jpai at openjdk.org Wed Feb 5 06:54:21 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Wed, 5 Feb 2025 06:54:21 GMT Subject: [jdk24] Integrated: 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 18:41:44 GMT, Jaikiran Pai wrote: > Can I please get a review of this backport of https://github.com/openjdk/jdk/pull/23420 into jdk24? > > This proposes to bring in those same backouts into `jdk24` to prevent the issue noted in that PR description. jdk24 is in rampdown and this backport will require an approval. A approval request has been raised in https://bugs.openjdk.org/browse/JDK-8349183?focusedId=14746841&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14746841. > > This backport into jdk24 wasn't clean due to a trivial merge conflict in `StringLatin1.java` file. That merge conflict was manually resolved (just like it was done against mainline). The git commands used to create this backport against jdk24 branch are: > > > > git cherry-pick --no-commit 618c5eb27b4c719afd577b690e6bcb21a45fcb0d > > git commit -m 'Backport 618c5eb27b4c719afd577b690e6bcb21a45fcb0d' > > > tier1, tier2 and tier3 testing is currently in progress with this change. This pull request has now been integrated. Changeset: f0837b21 Author: Jaikiran Pai URL: https://git.openjdk.org/jdk/commit/f0837b218317c7ac6e031a93381da3caa93946aa Stats: 169 lines in 6 files changed: 46 ins; 79 del; 44 mod 8349183: [BACKOUT] Optimization for StringBuilder append boolean & null 8349239: [BACKOUT] Reuse StringLatin1::putCharsAt and StringUTF16::putCharsAt Reviewed-by: rriggs, liach Backport-of: 618c5eb27b4c719afd577b690e6bcb21a45fcb0d ------------- PR: https://git.openjdk.org/jdk/pull/23425 From jbhateja at openjdk.org Wed Feb 5 07:09:15 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 5 Feb 2025 07:09:15 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com> References: <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com> Message-ID: On Tue, 4 Feb 2025 19:18:39 GMT, Chen Liang wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixing typos > > src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 42: > >> 40: } >> 41: >> 42: public interface Float16TernaryMathOp { > > Is there a reason we don't write the default impl explicitly in this class, but ask for a lambda for an implementation? Each intrinsified method only has one default impl, so I think we can just inline that into the method body here. This wrapper class is part of java.base module and only contains intrinsic entry points for APIs defined in Float16 class which is part of an incubation module. Thus, exposing intrinsic fallback code through lambda keeps the interface clean while actual API logic and comments around it remains intact in Float16 class. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1942344948 From thartmann at openjdk.org Wed Feb 5 07:40:13 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 5 Feb 2025 07:40:13 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: <0IHpI6LCLioH2g3C7YyJ4aPhI4wVJikZCTHA0EjcQcQ=.ff3631b4-b991-4de3-a2d5-bbf8bf34581d@github.com> On Mon, 3 Feb 2025 14:21:31 GMT, Jatin Bhateja wrote: >> Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature. >> Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations. >> >> Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence. >> >> Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction. >> >> Following are the performance numbers of the following existing microbenchmark >> https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java >> >> Patch passes following validation test >> [test/jdk/java/lang/Math/IeeeRecommendedTests.java >> ](https://github.com/openjdk/jdk/blob/master/test/jdk/java/lang/Math/IeeeRecommendedTests.java) >> >> >> Granite Rapids-AP (P-core Xeon) >> Baseline AVX512: >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 1296.141 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 838.954 ops/ns >> >> Withopt : >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 940.240 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 967.370 ops/ns >> >> Baseline AVX2: >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 63.673 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 26.898 ops/ns >> >> Withopt : >> Benchmark Mode Cnt Score Error Units >> Signum._5_copySignFloatTest thrpt 2 785.801 ops/ns >> Signum._7_copySignDoubleTest thrpt 2 558.710 ops/ns >> >> Sierra Forest (E-core Xeon) >> Baseline: >> Benchmark (seed) Mode Cnt Score Error Units >> o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 40.528 ops/ns >> o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 25.101 ops/ns >> >> Withopt: >> Benchmark (seed) Mode Cnt Score Error Units >> o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 676.... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Adding IR framework verification test `compiler/intrinsics/math/TestCopySignIntrinsic.java` fails with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation` on Mac x64: 1) Method "public void compiler.intrinsics.math.TestCopySignIntrinsic.testCopySignD()" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#COPYSIGN_D#_", " >0 "}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx", "true"}, applyIfAnd={}, applyIfNot={})" > Phase "PrintIdeal": - counts: Graph contains wrong number of nodes: * Constraint 1: "(\\d+(\\s){2}(CopySignD.*)+(\\s){2}===.*)" - Failed comparison: [found] 0 > 0 [given] - No nodes matched! 2) Method "public void compiler.intrinsics.math.TestCopySignIntrinsic.testCopySignF()" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#COPYSIGN_F#_", " >0 "}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx", "true"}, applyIfAnd={}, applyIfNot={})" > Phase "PrintIdeal": - counts: Graph contains wrong number of nodes: * Constraint 1: "(\\d+(\\s){2}(CopySignF.*)+(\\s){2}===.*)" - Failed comparison: [found] 0 > 0 [given] - No nodes matched! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2635945209 From duke at openjdk.org Wed Feb 5 07:56:18 2025 From: duke at openjdk.org (kuaiwei) Date: Wed, 5 Feb 2025 07:56:18 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v4] In-Reply-To: <-hI5Z8mnAi3ajR0NYQwSuM8iqyTs3utRNMz6szyycTY=.c52ae714-b859-4c0c-8dd7-1d560eec353c@github.com> References: <_t_76WPegk92hZEoZzPQlHbH0ZTIwwNH4z7dSxnU4Bo=.7ef2513e-7546-47ae-828b-b5279af74cb7@github.com> <1oq_oOYXwwRDql746qcBSGF6CTz7zgZc3pHFjaYgnQo=.159fc324-1761-4fa8-8b66-417f3ed6465c@github.com> <2HpaEafIZF40rcuOXNEI0Xqi9SfIbbKyQgas1PmNG4k=.b8437913-34bc-4a0a-b80b-3eeea93d23cb@github.com> <-hI5Z8mnAi3ajR0NYQwSuM8iqyTs3utRNMz6szyycTY=.c52ae714-b859-4c0c-8dd7-1d560eec353c@github.com> Message-ID: On Fri, 31 Jan 2025 06:42:56 GMT, Emanuel Peter wrote: >>> @kuaiwei Thanks for agreeing to do to separate out the additional improvement! ? >>> >>> Is this ready for a next round of reviews? >> >> @eme64 , yes, I think the patch is ready for review. Could you take time to check it? Thanks. > > @kuaiwei Sounds good. Thanks for all the work you put in. These things tend to come out more complicated than one first thinks ? > > I won't have time today, but I ran testing for commit 16 / v10. Feel free to ping me after the weekend, then I can have a look at the tests, and either tell you about the failures or review the code ;) @eme64 , Last week I'm in Chinese new year vocation. Now I'm back. How about your testing? Is there any problem? And could you review it? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2635992746 From epeter at openjdk.org Wed Feb 5 08:19:19 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 08:19:19 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Sun, 2 Feb 2025 21:36:03 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > jlong, not long I have a few more coming, but need to hop off the train real quick ;) src/hotspot/share/opto/mulnode.cpp line 2059: > 2057: > 2058: // Returns a lower bound on the number of trailing zeros in expr, or -1 if the number > 2059: // cannot be determined. Why not just return `0` if we cannot determine it? That would still be a correct lower bound, right? src/hotspot/share/opto/mulnode.cpp line 2081: > 2079: const TypeInt* rhs_t = phase->type(expr->in(2))->isa_int(); > 2080: if (rhs_t == nullptr || !rhs_t->is_con()) { > 2081: return -1; Suggestion: // Pattern: expr = (x << shift) if (expr->Opcode() == Op_LShift(bt)) { const TypeInt* shift_t = phase->type(expr->in(2))->isa_int(); if (shift_t == nullptr || !shift_t->is_con()) { return -1; src/hotspot/share/opto/mulnode.cpp line 2083: > 2081: return -1; > 2082: } > 2083: return rhs_t->get_con() & (type2aelembytes(bt) * BitsPerByte - 1); Suggestion: // We need to truncate the shift, as it may not have been canonicalized yet. // T_INT: 0..31 -> shift_mask = 4 * 8 - 1 = 31 // T_LONG: 0..63 -> shift_mask = 8 * 8 - 1 = 63 jint shift_mask = type2aelembytes(bt) * BitsPerByte - 1; return shift_t->get_con() & shift_mask; src/hotspot/share/opto/mulnode.cpp line 2102: > 2100: // (AndL (ConL (_ << #N)) #M) > 2101: // The M and N values must satisfy ((-1 << N) & M) == 0. > 2102: static bool AndIL_is_zero_element_under_mask(const PhaseGVN* phase, const Node* expr, const Node* mask, BasicType bt) { Suggestion: // Checks whether expr is neutral element (zero) under mask. We have: // (AndX expr mask) // The X in AndX must be I or L, depending on bt. // // We split the bits of expr into MSB and LSB, where LSB represents // all trailing zeros of expr: // MSB LSB // xxxxxx 0000000000 // // We check if the mask has no one bits in the corresponding higher // bits, i.e. if the number of trailing zeros is larger or equal to the // bit width of the expr, i.e. if the number of leading zeros for mask // is greater or equal to the number of bits in MSB: // 000000 00000yyyyy -> (AndX expr mask) = 0 -> return true // 0000yy yyyyyyyyyy -> (AndX expr mask) = 0000zz 0000000000 -> return false // static bool AndIL_is_zero_element_under_mask(const PhaseGVN* phase, const Node* expr, const Node* mask, BasicType bt) { src/hotspot/share/opto/mulnode.cpp line 2104: > 2102: static bool AndIL_is_zero_element_under_mask(const PhaseGVN* phase, const Node* expr, const Node* mask, BasicType bt) { > 2103: jint expr_trailing_zeros = AndIL_min_trailing_zeros(phase, expr, bt); > 2104: if (expr_trailing_zeros < 0) { It feels a little strange that the number of trailing zeros could be negative... That's why I would return 0 if we can prove nothing. It is still clear that we can do nothing here if it is zero, so we can just compare `<= 0`. Or what was the reason for returning `-1`? src/hotspot/share/opto/mulnode.cpp line 2111: > 2109: if (mask_t == nullptr || mask_t->lo_as_long() < 0) { > 2110: return false; > 2111: } Suggestion: // When the mask is negative, it has the most significant bit set. const TypeInteger* mask_t = phase->type(mask)->isa_integer(bt); if (mask_t == nullptr || mask_t->lo_as_long() < 0) { return false; } src/hotspot/share/opto/mulnode.cpp line 2113: > 2111: } > 2112: > 2113: jint mask_bit_width = mask_t->hi_as_long() == 0 ? 0 : (BitsPerLong - count_leading_zeros(mask_t->hi_as_long())); I would just split this into 2 separate things. Suggestion: // Is the mask always zero? if (mask_t->hi_as_long() == 0) { assert(mask_t->lo_as_long() == 0, "checked earlier"); return true; } jint mask_bit_width = BitsPerLong - count_leading_zeros(mask_t->hi_as_long()); ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22856#pullrequestreview-2594812563 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942353461 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942358554 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942370863 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942402331 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942405802 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942407646 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942411441 From epeter at openjdk.org Wed Feb 5 08:19:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 08:19:20 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 08:02:21 GMT, Emanuel Peter wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> jlong, not long > > src/hotspot/share/opto/mulnode.cpp line 2102: > >> 2100: // (AndL (ConL (_ << #N)) #M) >> 2101: // The M and N values must satisfy ((-1 << N) & M) == 0. >> 2102: static bool AndIL_is_zero_element_under_mask(const PhaseGVN* phase, const Node* expr, const Node* mask, BasicType bt) { > > Suggestion: > > // Checks whether expr is neutral element (zero) under mask. We have: > // (AndX expr mask) > // The X in AndX must be I or L, depending on bt. > // > // We split the bits of expr into MSB and LSB, where LSB represents > // all trailing zeros of expr: > // MSB LSB > // xxxxxx 0000000000 > // > // We check if the mask has no one bits in the corresponding higher > // bits, i.e. if the number of trailing zeros is larger or equal to the > // bit width of the expr, i.e. if the number of leading zeros for mask > // is greater or equal to the number of bits in MSB: > // 000000 00000yyyyy -> (AndX expr mask) = 0 -> return true > // 0000yy yyyyyyyyyy -> (AndX expr mask) = 0000zz 0000000000 -> return false > // > static bool AndIL_is_zero_element_under_mask(const PhaseGVN* phase, const Node* expr, const Node* mask, BasicType bt) { I would leave any details about `addition` to the use of this function, and any discussion how we find the trailing zeros to `AndIL_min_trailing_zeros`. Otherwise it's a little confusing. It's nice to have examples, and give the reader an intuition of what you are doing in the logic below. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942403973 From epeter at openjdk.org Wed Feb 5 08:19:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 08:19:20 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 08:03:43 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/mulnode.cpp line 2102: >> >>> 2100: // (AndL (ConL (_ << #N)) #M) >>> 2101: // The M and N values must satisfy ((-1 << N) & M) == 0. >>> 2102: static bool AndIL_is_zero_element_under_mask(const PhaseGVN* phase, const Node* expr, const Node* mask, BasicType bt) { >> >> Suggestion: >> >> // Checks whether expr is neutral element (zero) under mask. We have: >> // (AndX expr mask) >> // The X in AndX must be I or L, depending on bt. >> // >> // We split the bits of expr into MSB and LSB, where LSB represents >> // all trailing zeros of expr: >> // MSB LSB >> // xxxxxx 0000000000 >> // >> // We check if the mask has no one bits in the corresponding higher >> // bits, i.e. if the number of trailing zeros is larger or equal to the >> // bit width of the expr, i.e. if the number of leading zeros for mask >> // is greater or equal to the number of bits in MSB: >> // 000000 00000yyyyy -> (AndX expr mask) = 0 -> return true >> // 0000yy yyyyyyyyyy -> (AndX expr mask) = 0000zz 0000000000 -> return false >> // >> static bool AndIL_is_zero_element_under_mask(const PhaseGVN* phase, const Node* expr, const Node* mask, BasicType bt) { > > I would leave any details about `addition` to the use of this function, and any discussion how we find the trailing zeros to `AndIL_min_trailing_zeros`. Otherwise it's a little confusing. > > It's nice to have examples, and give the reader an intuition of what you are doing in the logic below. Feel free to tweak the description further ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942418028 From chagedorn at openjdk.org Wed Feb 5 08:26:42 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 5 Feb 2025 08:26:42 GMT Subject: RFR: 8346777: Add missing const declarations and rename variables [v2] In-Reply-To: References: Message-ID: > This simple patch adds some missing `const` and applies variable renamings and parameter reorderings > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: update ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23434/files - new: https://git.openjdk.org/jdk/pull/23434/files/502aa810..8ce9d89b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23434&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23434&range=00-01 Stats: 44 lines in 2 files changed: 37 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23434.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23434/head:pull/23434 PR: https://git.openjdk.org/jdk/pull/23434 From chagedorn at openjdk.org Wed Feb 5 08:26:42 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 5 Feb 2025 08:26:42 GMT Subject: RFR: 8346777: Add missing const declarations and rename variables In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 09:43:12 GMT, Christian Hagedorn wrote: > This simple patch adds some missing `const` and applies variable renamings and parameter reorderings > > Thanks, > Christian I undid the removal of `ReplaceInitAndStrideStrategy` that I extracted too early from the full patch for Assertion Predicates. This patch now becomes a simple cleanup patch (see updated description above). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23434#issuecomment-2636047008 From bkilambi at openjdk.org Wed Feb 5 08:28:16 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 5 Feb 2025 08:28:16 GMT Subject: RFR: 8348659: AArch64: IR rule failure with compiler/loopopts/superword/TestSplitPacks.java In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 11:19:27 GMT, Bhavana Kilambi wrote: > "test5a" in this file fails on Graviton3 (32B, SVE) as the compiler fails to match IR rules for vector size 2. This is because the minimum vector size for aarch64 machines is 8B and it does not support generation of vectors of 2 short values. > > Modified the IR rules to have two separate rules - one for sse4.1 and another for sve. > > The test now passes on Graviton3. Thank you ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23385#issuecomment-2636047920 From duke at openjdk.org Wed Feb 5 08:28:16 2025 From: duke at openjdk.org (duke) Date: Wed, 5 Feb 2025 08:28:16 GMT Subject: RFR: 8348659: AArch64: IR rule failure with compiler/loopopts/superword/TestSplitPacks.java In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 11:19:27 GMT, Bhavana Kilambi wrote: > "test5a" in this file fails on Graviton3 (32B, SVE) as the compiler fails to match IR rules for vector size 2. This is because the minimum vector size for aarch64 machines is 8B and it does not support generation of vectors of 2 short values. > > Modified the IR rules to have two separate rules - one for sse4.1 and another for sve. > > The test now passes on Graviton3. @Bhavana-Kilambi Your change (at version 486df1f6df17e1320a436bcf3cc8221b3a05c7d9) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23385#issuecomment-2636050130 From bkilambi at openjdk.org Wed Feb 5 08:40:16 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 5 Feb 2025 08:40:16 GMT Subject: Integrated: 8348659: AArch64: IR rule failure with compiler/loopopts/superword/TestSplitPacks.java In-Reply-To: References: Message-ID: On Fri, 31 Jan 2025 11:19:27 GMT, Bhavana Kilambi wrote: > "test5a" in this file fails on Graviton3 (32B, SVE) as the compiler fails to match IR rules for vector size 2. This is because the minimum vector size for aarch64 machines is 8B and it does not support generation of vectors of 2 short values. > > Modified the IR rules to have two separate rules - one for sse4.1 and another for sve. > > The test now passes on Graviton3. This pull request has now been integrated. Changeset: 66a38984 Author: Bhavana Kilambi Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/66a3898448023f1f22da7d7cbcf4c79a0eb59963 Stats: 13 lines in 1 file changed: 10 ins; 0 del; 3 mod 8348659: AArch64: IR rule failure with compiler/loopopts/superword/TestSplitPacks.java Reviewed-by: shade, epeter ------------- PR: https://git.openjdk.org/jdk/pull/23385 From qxing at openjdk.org Wed Feb 5 09:00:38 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Wed, 5 Feb 2025 09:00:38 GMT Subject: RFR: 8347499: C2: Make `PhaseIdealLoop` eliminate more redundant safepoints in loops [v2] In-Reply-To: References: Message-ID: > In `PhaseIdealLoop`, `IdealLoopTree::check_safepts` method checks if any call that is guaranteed to have a safepoint dominates the tail of the loop. In the previous implementation, `check_safepts` would stop if it found a local non-call safepoint. At this time, if there was a call before the safepoint in the dom-path, this safepoint would not be eliminated. > > loop-safepoint > > This patch changes the behavior of `check_safepts` to not stop when it finds a non-local safepoint. This makes simple loops with one method call ~3.8% faster (on aarch64). > > > Benchmark Mode Cnt Score Error Units > LoopSafepoint.loopVar avgt 15 208296.259 ? 1350.409 ns/op # baseline > LoopSafepoint.loopVar avgt 15 200692.874 ? 616.770 ns/op # this patch > > > Testing: tier1-2 on x86_64 and aarch64. Qizheng Xing has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into enhance-loop-safepoint-elim - Add IR test and microbench. - Make `PhaseIdealLoop` eliminate more redundant safepoints in loops. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23057/files - new: https://git.openjdk.org/jdk/pull/23057/files/23ef6aab..56983ed5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23057&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23057&range=00-01 Stats: 84570 lines in 4179 files changed: 35485 ins; 30248 del; 18837 mod Patch: https://git.openjdk.org/jdk/pull/23057.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23057/head:pull/23057 PR: https://git.openjdk.org/jdk/pull/23057 From qxing at openjdk.org Wed Feb 5 09:00:38 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Wed, 5 Feb 2025 09:00:38 GMT Subject: RFR: 8347499: C2: Make `PhaseIdealLoop` eliminate more redundant safepoints in loops In-Reply-To: <85FwNDVE3ZSF5HXyYfk54eM7xJVg5dw_5ztXqMG8els=.bc02130d-0988-482a-83fe-a4d76ec6730f@github.com> References: <85FwNDVE3ZSF5HXyYfk54eM7xJVg5dw_5ztXqMG8els=.bc02130d-0988-482a-83fe-a4d76ec6730f@github.com> Message-ID: On Fri, 31 Jan 2025 09:48:39 GMT, Tobias Hartmann wrote: >> In `PhaseIdealLoop`, `IdealLoopTree::check_safepts` method checks if any call that is guaranteed to have a safepoint dominates the tail of the loop. In the previous implementation, `check_safepts` would stop if it found a local non-call safepoint. At this time, if there was a call before the safepoint in the dom-path, this safepoint would not be eliminated. >> >> loop-safepoint >> >> This patch changes the behavior of `check_safepts` to not stop when it finds a non-local safepoint. This makes simple loops with one method call ~3.8% faster (on aarch64). >> >> >> Benchmark Mode Cnt Score Error Units >> LoopSafepoint.loopVar avgt 15 208296.259 ? 1350.409 ns/op # baseline >> LoopSafepoint.loopVar avgt 15 200692.874 ? 616.770 ns/op # this patch >> >> >> Testing: tier1-2 on x86_64 and aarch64. > > @MaxXSoft Could you please merge with master so that we can run some testing? Thanks. @TobiHartmann OK, merged. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23057#issuecomment-2636113527 From gcao at openjdk.org Wed Feb 5 09:09:43 2025 From: gcao at openjdk.org (Gui Cao) Date: Wed, 5 Feb 2025 09:09:43 GMT Subject: RFR: 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 Message-ID: Hi, please review this small change fixing an assertion error. As the alignment of the loading addresses is only ensured under -XX:-AvoidUnalignedAccesses, we should only enable the related assersions about the alignment under this option. ### Testing - [x] Sanity tested with -XX:-AvoidUnalignedAccesses using fastdebug build. - [ ] Run tier1 tests on SOPHON SG2042 (fastdebug) - [ ] Run tier1-3 tests on Milk-V Megrez (release) ------------- Commit messages: - 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 Changes: https://git.openjdk.org/jdk/pull/23459/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23459&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349428 Stats: 35 lines in 2 files changed: 14 ins; 0 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/23459.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23459/head:pull/23459 PR: https://git.openjdk.org/jdk/pull/23459 From epeter at openjdk.org Wed Feb 5 09:21:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:21:17 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Sun, 2 Feb 2025 21:36:03 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > jlong, not long It would be nice if we had some kind of "proof" here. `is zero wrt addition under mask`: Maybe this is a well-known mathematical property I'm not super familiar with, and then it would be good to define it more properly here for everybody ;) I'm a little worried that it is not straight forward to prove without making some assumptions about `mask` and `add1` that we only check inside of `AndIL_is_zero_element_under_mask`. That may be ok for now, but what if we somehow can improve `AndIL_is_zero_element_under_mask` at some point to also handle this case: mask = 0000 1110 add1 = 1111 0001 We could probably strenghten `AndIL_is_zero_element_under_mask` to detect that `add1 & mask = 0`. It is the zero element with the mask, so that seems ok now. But then what if `add2 = 0000 0011`? `(add1 + add2) & mask = (1111 0100) & (0000 1110) = 0000 0100` I'm not saying it is not correct now, but out here we make assumptions about the implementation of `AndIL_is_zero_element_under_mask` that the next person might not be aware of. Mabe we can adjust the name of `AndIL_is_zero_element_under_mask` somehow, so that the assumption is made explicit? Hmm, ok now I see why you mentioned the stuff about `Checks whether expr is neutral wrt addition under mask` in the description of `AndIL_is_zero_element_under_mask`.... Well ok then maybe we need to revisit my suggestion there to drop that stuff ? But it needs to be more clear why it is there, and what it guarantees. Maybe we can rename the method to: `AndIL_is_expr_additive_neutral_element_under_mask` I'll leave this to you to think about. You may have a better idea. But what I would like to see: - Good definition(s) - Proof (if possible formal) src/hotspot/share/opto/mulnode.cpp line 2122: > 2120: // Because the AddX operands can come in either > 2121: // order, we check for both orders. > 2122: Node* MulNode::AndIL_sum_and_mask(PhaseGVN* phase, BasicType bt) { Suggestion: // Pattern: // (AndX (AddX add1 add2) mask) // // Assume: // (AndX add1 mask) == 0 // // ... prove why we know that we can return: // (AndX add2 mask) Node* MulNode::AndIL_sum_and_mask(PhaseGVN* phase, BasicType bt) { ------------- PR Review: https://git.openjdk.org/jdk/pull/22856#pullrequestreview-2595070278 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1942504257 From epeter at openjdk.org Wed Feb 5 09:25:19 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:25:19 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v4] In-Reply-To: References: <_t_76WPegk92hZEoZzPQlHbH0ZTIwwNH4z7dSxnU4Bo=.7ef2513e-7546-47ae-828b-b5279af74cb7@github.com> <1oq_oOYXwwRDql746qcBSGF6CTz7zgZc3pHFjaYgnQo=.159fc324-1761-4fa8-8b66-417f3ed6465c@github.com> <2HpaEafIZF40rcuOXNEI0Xqi9SfIbbKyQgas1PmNG4k=.b8437913-34bc-4a0a-b80b-3eeea93d23cb@github.com> <-hI5Z8mnAi3ajR0NYQwSuM8iqyTs3utRNMz6szyycTY=.c52ae714-b859-4c0c-8dd7-1d560eec353c@github.com> Message-ID: On Wed, 5 Feb 2025 07:53:58 GMT, kuaiwei wrote: >> @kuaiwei Sounds good. Thanks for all the work you put in. These things tend to come out more complicated than one first thinks ? >> >> I won't have time today, but I ran testing for commit 16 / v10. Feel free to ping me after the weekend, then I can have a look at the tests, and either tell you about the failures or review the code ;) > > @eme64 , Last week I'm in Chinese new year vocation. Now I'm back. How about your testing? Is there any problem? And could you review it? Thanks. @kuaiwei The tests have all passed, great job! I hope to review the code soon, but I have a lot of reviews on my plate right now, so I ask you for patience ? ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2636176460 From epeter at openjdk.org Wed Feb 5 09:28:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:28:12 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts In-Reply-To: References: Message-ID: <7YOG7L0dZP0iEJzDPMW62pv8QYaZ8dJbfn3FbK5T-Vw=.02a740fe-31a6-484f-ad82-412cdca80155@github.com> On Tue, 4 Feb 2025 19:40:59 GMT, Jasmine Karthikeyan wrote: >> test/micro/org/openjdk/bench/vm/compiler/VectorSubword.java line 73: >> >>> 71: } >>> 72: } >>> 73: } >> >> Ah, these are all casting to smaller types. What about casting to larger types? > > Originally I thought that larger type conversions were only available on AVX-512 based on the ad-file, but reading it more carefully now I see that they are indeed supported for AVX! I'll go ahead and generalize the changes further to make casting to larger types supported as well. I think this leaves `char` as the only type not supported, which I can look at in a follow-up RFE. Well it's ok to leave the patch as is, and just have a follow-up RFE that improves things from here. It could make things easier to review. @jaskarth ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1942516088 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1942516705 From epeter at openjdk.org Wed Feb 5 09:30:21 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:30:21 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com> Message-ID: On Tue, 4 Feb 2025 18:52:13 GMT, Quan Anh Mai wrote: >> Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please? > >> Looks like an implicit nullptr check. Not allowed by code style ;) > > But the verb here is `isa` and we use these as a `bool` a lot, though :/ > >> Can you quickly explain this change from tak != TypeInstKlassPtr::OBJECT so I don't need to investigate myself, please? > > The bottom type of an array can be either `Object` or an array of some kind, so `tak != TypeInstKlassPtr::OBJECT` is the same as `tak->isa_aryklassptr()`. Ah great, thanks for the explanation! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1942520434 From epeter at openjdk.org Wed Feb 5 09:35:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:35:17 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Thu, 30 Jan 2025 17:11:08 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: >> >> // We are allowed to use the constant type only if cast succeeded >> >> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. >> >> Please take a look and leave your reviews, thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > format Looks good, thanks for the explanations! I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok. But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2595110848 From epeter at openjdk.org Wed Feb 5 09:35:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:35:18 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Tue, 4 Feb 2025 18:54:21 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/parseHelper.cpp line 193: >> >>> 191: // See issue JDK-8057622 for details. >>> 192: >>> 193: always_see_exact_class = true; >> >> Why is it ok to remove this? >> If this branch is not taken, it used to be `false`, and would lead to something different below... > > The only use of this is to decide if we need to attach a control input to the `LoadKlass`. As the control input is not needed, this can be removed. Got it, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1942528816 From duke at openjdk.org Wed Feb 5 09:42:13 2025 From: duke at openjdk.org (kuaiwei) Date: Wed, 5 Feb 2025 09:42:13 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v4] In-Reply-To: References: <_t_76WPegk92hZEoZzPQlHbH0ZTIwwNH4z7dSxnU4Bo=.7ef2513e-7546-47ae-828b-b5279af74cb7@github.com> <1oq_oOYXwwRDql746qcBSGF6CTz7zgZc3pHFjaYgnQo=.159fc324-1761-4fa8-8b66-417f3ed6465c@github.com> <2HpaEafIZF40rcuOXNEI0Xqi9SfIbbKyQgas1PmNG4k=.b8437913-34bc-4a0a-b80b-3eeea93d23cb@github.com> <-hI5Z8mnAi3ajR0NYQwSuM8iqyTs3utRNMz6szyycTY=.c52ae714-b859-4c0c-8dd7-1d560eec353c@github.com> Message-ID: On Wed, 5 Feb 2025 09:22:41 GMT, Emanuel Peter wrote: >> @eme64 , Last week I'm in Chinese new year vocation. Now I'm back. How about your testing? Is there any problem? And could you review it? Thanks. > > @kuaiwei The tests have all passed, great job! > I hope to review the code soon, but I have a lot of reviews on my plate right now, so I ask you for patience ? ? @eme64 No problem. I'm not in a hurry. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2636215134 From epeter at openjdk.org Wed Feb 5 09:51:28 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:51:28 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v23] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: On Thu, 30 Jan 2025 14:49:35 GMT, Jatin Bhateja wrote: >> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. >> Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. >> >> Following are the performance stats for JMH micro included with the patch. >> >> >> Granite Rapids (P-core Xeon Server) >> Baseline : >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms >> >> Sierra Forest (E-core Xeon Server) >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.com... > > Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Lowering feature check to IR annotation level Changes requested by epeter (Reviewer). src/hotspot/share/opto/vectornode.cpp line 1045: > 1043: } > 1044: > 1045: bool VectorNode::should_swap_inputs() { Suggestion: // If we have a AddVB(v1, v2) and AddVB(v2, v1), we want to swap the edges of one of them // so that they become identical, and can common in global value numbering. bool VectorNode::should_swap_inputs_to_help_global_value_numbering() { src/hotspot/share/opto/vectornode.cpp line 1058: > 1056: if (is_predicated_vector()) { > 1057: return false; > 1058: } Hmm. Can you give me a concrete example of a masked operation that would be filtered out? Can it for example be a `AddVI`? But that only has 2 inputs for `VEC1` and `VEC2`. Where would the mask be located - and why does that not get us to `req() > 3`? Ah, I see it can be added in `VectorNode::try_to_gen_masked_vector`, with `add_req`, but then we should have `req() > 3`. Ok, this looks a bit complicated, but it looks like we are doing this. // Generate a vector mask for vector operation whose vector length is lower than the // hardware supported max vector length. Ok, fine. It could be good to add a comment here though, explaining why the operation seemingly has 3 inputs, but we don't exit at `req() != 3` above. src/hotspot/share/opto/vectornode.cpp line 1105: > 1103: > 1104: // Sort inputs of commutative non-predicated vector operations to help value numbering. > 1105: if (should_swap_inputs()) { Suggestion: if (should_swap_inputs_to_help_global_value_numbering()) { src/hotspot/share/opto/vectornode.hpp line 91: > 89: static bool is_minmax_opcode(int opc); > 90: > 91: bool should_swap_inputs(); Suggestion: bool should_swap_inputs_to_help_global_value_numbering(); ------------- PR Review: https://git.openjdk.org/jdk/pull/22863#pullrequestreview-2595115845 PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1942531991 PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1942553076 PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1942531523 PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1942532338 From gcao at openjdk.org Wed Feb 5 10:36:52 2025 From: gcao at openjdk.org (Gui Cao) Date: Wed, 5 Feb 2025 10:36:52 GMT Subject: RFR: 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 [v2] In-Reply-To: References: Message-ID: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> > Hi, please review this small change fixing an assertion error. > As the alignment of the loading addresses is only ensured under -XX:-AvoidUnalignedAccesses, we should only enable the related assersions about the alignment under this option. > > > ### Testing > - [x] Sanity tested with -XX:-AvoidUnalignedAccesses using fastdebug build. > - [ ] Run tier1 tests on SOPHON SG2042 (fastdebug) > - [ ] Run tier1-3 tests on Milk-V Megrez (release) Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Fix build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23459/files - new: https://git.openjdk.org/jdk/pull/23459/files/4d925697..71b6ecc8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23459&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23459&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23459.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23459/head:pull/23459 PR: https://git.openjdk.org/jdk/pull/23459 From mablakatov at openjdk.org Wed Feb 5 11:20:59 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 5 Feb 2025 11:20:59 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v2] In-Reply-To: References: Message-ID: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length: > > Benchmark (size) Mode Old New Units > Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms > Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms > Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms > Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms > Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms > Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms > > > Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length: > > Benchmark (size) Mode Old New Units > Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms > Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms > Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms > Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms > Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms > Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: Use EXT instead of COMPACT to split a vector into two halves Benchmarks results: Neoverse-V1 (SVE 256-bit) Benchmark (size) Mode master PR Units ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms Fujitsu A64FX (SVE 512-bit) Benchmark (size) Mode master PR Units ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23181/files - new: https://git.openjdk.org/jdk/pull/23181/files/0a62dc33..c9dcc45f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=00-01 Stats: 140 lines in 7 files changed: 10 ins; 6 del; 124 mod Patch: https://git.openjdk.org/jdk/pull/23181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181 PR: https://git.openjdk.org/jdk/pull/23181 From mablakatov at openjdk.org Wed Feb 5 11:30:18 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 5 Feb 2025 11:30:18 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation In-Reply-To: <1BK9bek0j5RjlBJsQUIyNwwQgcRoiIQiiob5SNuUQYw=.3ed57f8c-d589-4da0-ba98-5884450df886@github.com> References: <1BK9bek0j5RjlBJsQUIyNwwQgcRoiIQiiob5SNuUQYw=.3ed57f8c-d589-4da0-ba98-5884450df886@github.com> Message-ID: On Sat, 18 Jan 2025 09:03:19 GMT, Andrew Haley wrote: > Please provide info about whuch CPUs are benchmarked. How does this compare with Graviton 4? Hi @theRealAph , I've updated the description to reflect the former. As for the latter, nothing changes for Graviton 4, as for 128b long vectors we keep using the existing Neon implementation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-2636468714 From mablakatov at openjdk.org Wed Feb 5 11:40:09 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 5 Feb 2025 11:40:09 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v2] In-Reply-To: References: Message-ID: On Mon, 20 Jan 2025 03:35:44 GMT, Xiaohong Gong wrote: >> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: >> >> Use EXT instead of COMPACT to split a vector into two halves >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> Fujitsu A64FX (SVE 512-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2095: > >> 2093: // matter: a contiguous set of elements is moved and its size is a multiple of D RegVariant. >> 2094: sve_compact(vtmp1, D, vsrc, pgtmp1); >> 2095: sve_mul(vsrc, elemType_to_regVariant(bt), pgtmp2, vtmp1); > > Did you have tried with the SVE `EXT` instruction (https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/EXT--Extract-vector-from-pair-of-vectors-?lang=en), which I think could also help to shuffle the upper half elements to the lower half in a vector? If it works, I think these five instructions can be optimized to three ones such as `ext, whilelo, mul`. Hi @XiaohongGong , thank you for a great suggestion! I've submitted https://github.com/openjdk/jdk/pull/23181/commits/c9dcc45f7f362f5af87f013715f0b55777472c78 to implement it. It gives up to ~30% performance improvement compared to the initially submitted implementation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1942711747 From mdoerr at openjdk.org Wed Feb 5 11:57:09 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 5 Feb 2025 11:57:09 GMT Subject: RFR: 8348520: [s390x] Problemlist TestVectorReinterpret.java In-Reply-To: References: Message-ID: On Fri, 24 Jan 2025 03:58:12 GMT, Amit Kumar wrote: > Problem listing TestVectorReinterpret.java on s390x. Looks good and trivial. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23288#pullrequestreview-2595461435 From thartmann at openjdk.org Wed Feb 5 12:18:08 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 5 Feb 2025 12:18:08 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v3] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> Message-ID: On Wed, 5 Feb 2025 12:14:55 GMT, Emanuel Peter wrote: >> A quick summary: >> - In [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126), we decided that we are only going to allow irreducible loops that were detected at parsing, and we can thus restrict optimizations to reducible loops which would be difficult to do correct with irreducible loops. That's why we added that assert that checks that no new irreducible loop shows up during compilation. >> - Problem: we use `split_if` for `IfNode::Ideal_common` to split through a Region that is loop-head, and the splitting of the Region introduces a second loop entry -> irreducible loop. >> >> Before `split_if`: >> ![image](https://github.com/user-attachments/assets/01bc78fa-7fed-4a8f-b6f4-078dac9b5dc4) >> >> After `split_if`: >> ![image](https://github.com/user-attachments/assets/1e3bd08e-b76d-4e7f-813e-27a5a22cb2bd) >> >> >> - We have the `split_if` for `IfNode::Ideal_common` to do split-if on straight-line code. But we currently execute this before loop-opts, and so we don't know if the region we split through is actually a loop head. We guard against LoopNode, but a Region only becomes a LoopNode in loop-opts. >> - We also have split-if in loop-opts, which is more careful about splitting through loop-heads. >> - Just removing the straight-line split-if probably leads to a regression, as the loop-opts version only executes if there are loops for example. >> - We could consider delaying the straight-line split-if until after loop-opts. But I don't know if that could lead to regressions in any way. >> >> I discussed this temporary solution with @TobiHartmann : >> - We would like [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) to be unblocked for @shipilev . >> - Convert the assert into a bailout-check, so we are sure we behave correctly in product. Compiling with irreducible loops behaves correctly in almost all cases, but there could be exceptions. >> - For now, have the assert behind a Verify flag, so that [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) is unblocked. Later, we can remove the Verify flag and alway enable the assert again. >> - This fix also looks easier to backport. >> >> ----------------------- >> >> The attached regression test now does **NOT** fail by default, but rather silently bails out of compilation. >> >> With the new debug flag `-XX:+VerifyNoNewIrreducibleLoops`, we still hit the assert, as expected: >> >> # Internal Error (/oracle-work/jdk-fork0/open/src/hotspot/share/opto/loopnode.cpp:5636), pid=3698055, tid=3698072 >> # asser... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/loopopts/TestSplitIfNewIrreducibleLoop.java > > Co-authored-by: Tobias Hartmann Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23363#pullrequestreview-2595518253 From epeter at openjdk.org Wed Feb 5 12:18:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 12:18:09 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <_3Sc6qth_bahQb2eOsLLf0mb9ATrPHGwM6GedAOAUyU=.bc9fc68f-d573-48af-a4ab-f9260422c460@github.com> Message-ID: On Wed, 5 Feb 2025 06:21:56 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> update for Vladimir > > src/hotspot/share/opto/loopnode.cpp line 5648: > >> 5646: if (!head->can_be_irreducible_entry()) { >> 5647: assert(!VerifyNoNewIrreducibleLoops, "A new irreducible loop was created after parsing."); >> 5648: C->record_method_not_compilable("A new irreducible loop was created after parsing."); > > If you haven't done that yet, I would suggest to hardcode these bailouts to "always bail" out and run testing to check if the bailout always works. You'll of course get all kinds of test failures but the VM should not crash/assert (you can filter for these in the test results and ignore anything else). As discussed offline: There is another bailout below, for the case of irreducible loops that are also infinite. And that one also triggers regularly, and as far as I remember even with the fuzzer. So I'd say he bailout path is sufficiently covered. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1942769472 From thartmann at openjdk.org Wed Feb 5 12:18:10 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 5 Feb 2025 12:18:10 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v2] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> <_3Sc6qth_bahQb2eOsLLf0mb9ATrPHGwM6GedAOAUyU=.bc9fc68f-d573-48af-a4ab-f9260422c460@github.com> Message-ID: On Wed, 5 Feb 2025 12:13:18 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.cpp line 5648: >> >>> 5646: if (!head->can_be_irreducible_entry()) { >>> 5647: assert(!VerifyNoNewIrreducibleLoops, "A new irreducible loop was created after parsing."); >>> 5648: C->record_method_not_compilable("A new irreducible loop was created after parsing."); >> >> If you haven't done that yet, I would suggest to hardcode these bailouts to "always bail" out and run testing to check if the bailout always works. You'll of course get all kinds of test failures but the VM should not crash/assert (you can filter for these in the test results and ignore anything else). > > As discussed offline: > There is another bailout below, for the case of irreducible loops that are also infinite. And that one also triggers regularly, and as far as I remember even with the fuzzer. So I'd say he bailout path is sufficiently covered. Yes, makes sense. Thanks for checking. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23363#discussion_r1942771243 From epeter at openjdk.org Wed Feb 5 12:18:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 12:18:07 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v3] In-Reply-To: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> Message-ID: > A quick summary: > - In [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126), we decided that we are only going to allow irreducible loops that were detected at parsing, and we can thus restrict optimizations to reducible loops which would be difficult to do correct with irreducible loops. That's why we added that assert that checks that no new irreducible loop shows up during compilation. > - Problem: we use `split_if` for `IfNode::Ideal_common` to split through a Region that is loop-head, and the splitting of the Region introduces a second loop entry -> irreducible loop. > > Before `split_if`: > ![image](https://github.com/user-attachments/assets/01bc78fa-7fed-4a8f-b6f4-078dac9b5dc4) > > After `split_if`: > ![image](https://github.com/user-attachments/assets/1e3bd08e-b76d-4e7f-813e-27a5a22cb2bd) > > > - We have the `split_if` for `IfNode::Ideal_common` to do split-if on straight-line code. But we currently execute this before loop-opts, and so we don't know if the region we split through is actually a loop head. We guard against LoopNode, but a Region only becomes a LoopNode in loop-opts. > - We also have split-if in loop-opts, which is more careful about splitting through loop-heads. > - Just removing the straight-line split-if probably leads to a regression, as the loop-opts version only executes if there are loops for example. > - We could consider delaying the straight-line split-if until after loop-opts. But I don't know if that could lead to regressions in any way. > > I discussed this temporary solution with @TobiHartmann : > - We would like [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) to be unblocked for @shipilev . > - Convert the assert into a bailout-check, so we are sure we behave correctly in product. Compiling with irreducible loops behaves correctly in almost all cases, but there could be exceptions. > - For now, have the assert behind a Verify flag, so that [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) is unblocked. Later, we can remove the Verify flag and alway enable the assert again. > - This fix also looks easier to backport. > > ----------------------- > > The attached regression test now does **NOT** fail by default, but rather silently bails out of compilation. > > With the new debug flag `-XX:+VerifyNoNewIrreducibleLoops`, we still hit the assert, as expected: > > # Internal Error (/oracle-work/jdk-fork0/open/src/hotspot/share/opto/loopnode.cpp:5636), pid=3698055, tid=3698072 > # assert(!VerifyNoNewIrreducibleLoops) failed: A new irreducible lo... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/loopopts/TestSplitIfNewIrreducibleLoop.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23363/files - new: https://git.openjdk.org/jdk/pull/23363/files/352ebb91..209360df Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23363&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23363&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23363.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23363/head:pull/23363 PR: https://git.openjdk.org/jdk/pull/23363 From rcastanedalo at openjdk.org Wed Feb 5 12:38:06 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Feb 2025 12:38:06 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> Message-ID: <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> > G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. > > The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: > > > o = new MyObject(); > if (...) { > o.myField = ...; // barrier elided only after this changeset > // (assuming no safepoint in the if condition) > } > > > or in initialization writes placed after exception-throwing checks: > > > o = new MyObject(); > if (...) { > throw new Exception(""); > } > o.myField = ...; // barrier elided only after this changeset > // (assuming no safepoint in the above if condition) > > > These patterns are commonly found in Java code, e.g. in the core libraries: > > - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or > > - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). > > The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): > > > Object[] a = new Object[...]; > for (int i = 0; i < a.length; i++) { > a[i] = ...; // barrier elided only after this changeset > } > > > or eliding barriers from array initialization writes with unknown array index: > > > Object[] a = new Object[...]; > a[index] = ...; // barrier elided only after this changeset > > > The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_index`, `look_through_node`, `is_{undefined|unknown|concrete}`, `get_base_and_offset`, `is_array... Roberto Casta?eda Lozano has updated the pull request incrementally with two additional commits since the last revision: - Add some more tests to exercise barrier elision for atomic operations - Elide barriers from atomic operations on newly allocated objects as well ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23235/files - new: https://git.openjdk.org/jdk/pull/23235/files/3d154fa8..621a61cf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23235&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23235&range=02-03 Stats: 174 lines in 2 files changed: 167 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23235.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23235/head:pull/23235 PR: https://git.openjdk.org/jdk/pull/23235 From fyang at openjdk.org Wed Feb 5 12:42:14 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 5 Feb 2025 12:42:14 GMT Subject: RFR: 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 [v2] In-Reply-To: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> References: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> Message-ID: On Wed, 5 Feb 2025 10:36:52 GMT, Gui Cao wrote: >> Hi, please review this small change fixing an assertion error. >> As the alignment of the loading addresses is only ensured under -XX:-AvoidUnalignedAccesses, we should only enable the related assersions about the alignment under this option. >> >> >> ### Testing >> - [x] Sanity tested with -XX:-AvoidUnalignedAccesses using fastdebug build. >> - [ ] Run tier1 tests on SOPHON SG2042 (fastdebug) >> - [ ] Run tier1-3 tests on Milk-V Megrez (release) > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix build Looks good. Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23459#pullrequestreview-2595631970 From rcastanedalo at openjdk.org Wed Feb 5 12:42:15 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Feb 2025 12:42:15 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v2] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> Message-ID: On Fri, 31 Jan 2025 14:06:16 GMT, Roberto Casta?eda Lozano wrote: > > One question about elision for atomics. > > Otherwise it seems good afaict, although a large part was checking that the code movement is/was correct. > > Thanks for reviewing Thomas! Please let me know whether you want me to extend this changeset to elide barriers on atomic operations (happy to do so). @tschatzl I did extend the changeset now to also elide barriers on atomic operations, as discussed offline. Please have a look again. @offamitkumar @TheRealMDoerr @RealFYang @snazarkin you might want to re-test the changeset on your respective platforms. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2636669816 From mablakatov at openjdk.org Wed Feb 5 12:43:11 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 5 Feb 2025 12:43:11 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 18:48:38 GMT, Emanuel Peter wrote: > This could also be a relevant Benchmark: ./test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java Thank you for pointing this out. I didn't take an effect on auto-vectorization into consideration, though I should. I'll revert on this in a bit. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-2636684622 From epeter at openjdk.org Wed Feb 5 13:01:19 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 13:01:19 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v3] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> Message-ID: On Wed, 5 Feb 2025 12:14:15 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/loopopts/TestSplitIfNewIrreducibleLoop.java >> >> Co-authored-by: Tobias Hartmann > > Marked as reviewed by thartmann (Reviewer). @TobiHartmann @vnkozlov thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23363#issuecomment-2636790177 From epeter at openjdk.org Wed Feb 5 13:01:21 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 13:01:21 GMT Subject: Integrated: 8348572: C2 compilation asserts due to unexpected irreducible loop In-Reply-To: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> Message-ID: On Thu, 30 Jan 2025 10:03:49 GMT, Emanuel Peter wrote: > A quick summary: > - In [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126), we decided that we are only going to allow irreducible loops that were detected at parsing, and we can thus restrict optimizations to reducible loops which would be difficult to do correct with irreducible loops. That's why we added that assert that checks that no new irreducible loop shows up during compilation. > - Problem: we use `split_if` for `IfNode::Ideal_common` to split through a Region that is loop-head, and the splitting of the Region introduces a second loop entry -> irreducible loop. > > Before `split_if`: > ![image](https://github.com/user-attachments/assets/01bc78fa-7fed-4a8f-b6f4-078dac9b5dc4) > > After `split_if`: > ![image](https://github.com/user-attachments/assets/1e3bd08e-b76d-4e7f-813e-27a5a22cb2bd) > > > - We have the `split_if` for `IfNode::Ideal_common` to do split-if on straight-line code. But we currently execute this before loop-opts, and so we don't know if the region we split through is actually a loop head. We guard against LoopNode, but a Region only becomes a LoopNode in loop-opts. > - We also have split-if in loop-opts, which is more careful about splitting through loop-heads. > - Just removing the straight-line split-if probably leads to a regression, as the loop-opts version only executes if there are loops for example. > - We could consider delaying the straight-line split-if until after loop-opts. But I don't know if that could lead to regressions in any way. > > I discussed this temporary solution with @TobiHartmann : > - We would like [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) to be unblocked for @shipilev . > - Convert the assert into a bailout-check, so we are sure we behave correctly in product. Compiling with irreducible loops behaves correctly in almost all cases, but there could be exceptions. > - For now, have the assert behind a Verify flag, so that [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) is unblocked. Later, we can remove the Verify flag and alway enable the assert again. > - This fix also looks easier to backport. > > ----------------------- > > The attached regression test now does **NOT** fail by default, but rather silently bails out of compilation. > > With the new debug flag `-XX:+VerifyNoNewIrreducibleLoops`, we still hit the assert, as expected: > > # Internal Error (/oracle-work/jdk-fork0/open/src/hotspot/share/opto/loopnode.cpp:5636), pid=3698055, tid=3698072 > # assert(!VerifyNoNewIrreducibleLoops) failed: A new irreducible lo... This pull request has now been integrated. Changeset: 19399d27 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/19399d271ef00f925232fbbe9087b5772f2fca01 Stats: 115 lines in 5 files changed: 107 ins; 1 del; 7 mod 8348572: C2 compilation asserts due to unexpected irreducible loop Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/23363 From epeter at openjdk.org Wed Feb 5 13:15:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 13:15:15 GMT Subject: RFR: 8346777: Add missing const declarations and rename variables [v2] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 08:26:42 GMT, Christian Hagedorn wrote: >> This simple patch adds some missing `const` and applies variable renamings and parameter reorderings >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > update Looks good :) Nice and simple! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23434#pullrequestreview-2595744443 From roland at openjdk.org Wed Feb 5 13:57:22 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 5 Feb 2025 13:57:22 GMT Subject: RFR: 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 12:01:46 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - review >> - Merge branch 'master' into JDK-8333697 >> - fix > > All clean. @TobiHartmann @vnkozlov thanks for the reviews and testing ------------- PR Comment: https://git.openjdk.org/jdk/pull/23075#issuecomment-2636915041 From roland at openjdk.org Wed Feb 5 13:57:23 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 5 Feb 2025 13:57:23 GMT Subject: Integrated: 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion In-Reply-To: References: Message-ID: On Mon, 13 Jan 2025 14:49:10 GMT, Roland Westrelin wrote: > I investigated the failure from the `Test.java` that's attached to the > bug. The failure with this test is only reproducible up to 8334060 > (Implementation of Late Barrier Expansion for G1) so experiments I > describe here are from the source code for the commit right before it. > > Peak malloc memory usage reported by NMT is: 1.3GB > > `PhaseCFG::global_code_motion()`, when `OptoRegScheduling` is true, > creates a `PhaseIFG` that's, when initialized, allocates `_adjs`: a > `maxlrg` array of `IndexSet`s that can contain up to `maxlrg`. > > `maxlrg` in this case is 122839. An `IndexSet` is an array of pointers > to a 256 bit bitset: one `IndexSet` array needs: > > > 122839 / 256 * 8 = 3832 > > > and there are of 122839: > > > 3832 * 122839 = ~470 MB > > > It turns out the `PhaseIFG` object when used from > `PhaseCFG::global_code_motion()` doesn't even use the `_adjs` > array. So a patch like: > > > diff --git a/src/hotspot/share/opto/chaitin.hpp b/src/hotspot/share/opto/chaitin.hpp > index cf02deb6019..4e5333bf181 100644 > --- a/src/hotspot/share/opto/chaitin.hpp > +++ b/src/hotspot/share/opto/chaitin.hpp > @@ -258,7 +258,7 @@ class PhaseIFG : public Phase { > VectorSet *_yanked; > > PhaseIFG( Arena *arena ); > - void init( uint maxlrg ); > + void init( uint maxlrg, bool no_adjs = false ); > > // Add edge between a and b. Returns true if actually added. > int add_edge( uint a, uint b ); > diff --git a/src/hotspot/share/opto/gcm.cpp b/src/hotspot/share/opto/gcm.cpp > index ebdefe597ff..fefd75a88c5 100644 > --- a/src/hotspot/share/opto/gcm.cpp > +++ b/src/hotspot/share/opto/gcm.cpp > @@ -1704,7 +1704,9 @@ void PhaseCFG::global_code_motion() { > rm_live.reset_to_mark(); // Reclaim working storage > IndexSet::reset_memory(C, &live_arena); > uint node_size = regalloc._lrg_map.max_lrg_id(); > - ifg.init(node_size); // Empty IFG > + ifg.init(node_size, true); // Empty IFG > regalloc.set_ifg(ifg); > regalloc.set_live(live); > regalloc.gather_lrg_masks(false); // Collect LRG masks > diff --git a/src/hotspot/share/opto/ifg.cpp b/src/hotspot/share/opto/ifg.cpp > index d12698121b9..e42121c2254 100644 > --- a/src/hotspot/share/opto/ifg.cpp > +++ b/src/hotspot/share/opto/ifg.cpp > @@ -42,18 +42,24 @@ > PhaseIFG::PhaseIFG( Arena *arena ) : Phase(Interference_Graph), _arena(arena) { > } > > -void PhaseIFG::init( uint maxlrg ) { > +void PhaseIFG::init( uint maxlrg, bool no_adjs ) { > _maxlrg = maxlrg; > _yanked = new (_arena) VectorSet(_arena); > _is_square = false; > // Make uninitialized adjacency lists > - ... This pull request has now been integrated. Changeset: 6b994cd8 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/6b994cd8ccba4f5d0199cb2925f0a6b5450ac115 Stats: 61 lines in 3 files changed: 37 ins; 15 del; 9 mod 8333697: C2: Hit MemLimit in PhaseCFG::global_code_motion Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/23075 From roland at openjdk.org Wed Feb 5 13:58:14 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 5 Feb 2025 13:58:14 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v9] In-Reply-To: <4NyjmA6xaOMLUYbMqPXgkxZnhtBVj3feMH2Z4wjum5k=.9f70f7ec-a782-43c5-9350-c9e10bd1d3ea@github.com> References: <4NyjmA6xaOMLUYbMqPXgkxZnhtBVj3feMH2Z4wjum5k=.9f70f7ec-a782-43c5-9350-c9e10bd1d3ea@github.com> Message-ID: On Wed, 5 Feb 2025 06:05:44 GMT, Tobias Hartmann wrote: > Maybe it's (related to) [JDK-8341976](https://bugs.openjdk.org/browse/JDK-8341976)? It should be the same. It showed in a previous round of testing and I wrote the test case `JDK-8341976` once I understood the bug was unrelated to this change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-2636920197 From roland at openjdk.org Wed Feb 5 14:11:25 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 5 Feb 2025 14:11:25 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v2] In-Reply-To: References: Message-ID: > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/806eb20f..a1225f74 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=00-01 Stats: 6 lines in 2 files changed: 2 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From epeter at openjdk.org Wed Feb 5 14:35:28 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 14:35:28 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 12:29:34 GMT, Daniel Lund?n wrote: >> When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks. >> >> #### Example 1 >> >> Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 _after_ a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value). >> >> ![failure-graph-1](https://github.com/user-attachments/assets/e5458646-7a5c-40e1-b1d8-e3f101e29b73) >> ![failure-blocks-1](https://github.com/user-attachments/assets/a0b1f724-0809-4b2f-9feb-93e9c59a5d6a) >> >> #### Example 2 >> >> There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9. >> >> ![failure-graph-2](https://github.com/user-attachments/assets/ede0c299-6251-4ff8-8b84-af40a1ee9e8c) >> ![failure-blocks-2](https://github.com/user-attachments/assets/e5a87e43-b6fe-4fa3-8961-54752f63633e) >> >> ### Cha... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java > > Co-authored-by: Christian Hagedorn src/hotspot/share/opto/gcm.cpp line 831: > 829: // | 8 membar_release <- 7 | early > 830: // | ... | > 831: // +-----------------------+ I just discussed this example with @chhagedorn // Patch the existing phi to select an input from the merge: // Phi:AT1(...MergeMem(m0, m1, m2)...) into // Phi:AT1(...m1...) int alias_idx = phase->C->get_alias_index(at); And // Phi(...MergeMem(m0, m1:AT1, m2:AT2)...) into // MergeMem(Phi(...m0...), Phi:AT1(...m1...), Phi:AT2(...m2...)) In `cfgnode.cpp`, we try to move the MergeMem after Phi. Why does this not happen in this example? There are many cases in that code... but it seems to me that here something may be missing. I have not given it more time though. If we knew that MergeMem always happened after the Phi, then we could only search from the `initial_mem`, and would walk through all relevant MergeMem, right? This is just an intuition, but maybe having MergeMem after Phi is a fundamental assumption. Or maybe it just happens in all cases, and yours is the only we found so far where that is not possible. What do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1943046842 From epeter at openjdk.org Wed Feb 5 14:35:28 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 14:35:28 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 14:29:09 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java >> >> Co-authored-by: Christian Hagedorn > > src/hotspot/share/opto/gcm.cpp line 831: > >> 829: // | 8 membar_release <- 7 | early >> 830: // | ... | >> 831: // +-----------------------+ > > I just discussed this example with @chhagedorn > > // Patch the existing phi to select an input from the merge: > // Phi:AT1(...MergeMem(m0, m1, m2)...) into > // Phi:AT1(...m1...) > int alias_idx = phase->C->get_alias_index(at); > > And > > // Phi(...MergeMem(m0, m1:AT1, m2:AT2)...) into > // MergeMem(Phi(...m0...), Phi:AT1(...m1...), Phi:AT2(...m2...)) > > > In `cfgnode.cpp`, we try to move the MergeMem after Phi. Why does this not happen in this example? > > There are many cases in that code... but it seems to me that here something may be missing. I have not given it more time though. > > If we knew that MergeMem always happened after the Phi, then we could only search from the `initial_mem`, and would walk through all relevant MergeMem, right? > > This is just an intuition, but maybe having MergeMem after Phi is a fundamental assumption. Or maybe it just happens in all cases, and yours is the only we found so far where that is not possible. > > What do you think? @dlunde I really don't want to block you here. I never understood the memory graph above the initial mem. Now that I see the example I'm getting new ideas ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1943052008 From chagedorn at openjdk.org Wed Feb 5 14:35:21 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 5 Feb 2025 14:35:21 GMT Subject: RFR: 8346777: Add missing const declarations and rename variables [v2] In-Reply-To: References: Message-ID: <-zFcMLfsYDwGBn-zWqZ0AxS481mZRpz_wKS6bMDc9JA=.b7ffdfce-b773-42b7-9507-7609053de66b@github.com> On Wed, 5 Feb 2025 08:26:42 GMT, Christian Hagedorn wrote: >> This simple patch adds some missing `const` and applies variable renamings and parameter reorderings >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > update Thanks Emanuel, indeed! :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23434#issuecomment-2637014955 From mdoerr at openjdk.org Wed Feb 5 15:09:17 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 5 Feb 2025 15:09:17 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> Message-ID: On Wed, 5 Feb 2025 12:38:06 GMT, Roberto Casta?eda Lozano wrote: >> G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. >> >> The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: >> >> >> o = new MyObject(); >> if (...) { >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the if condition) >> } >> >> >> or in initialization writes placed after exception-throwing checks: >> >> >> o = new MyObject(); >> if (...) { >> throw new Exception(""); >> } >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the above if condition) >> >> >> These patterns are commonly found in Java code, e.g. in the core libraries: >> >> - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or >> >> - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). >> >> The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): >> >> >> Object[] a = new Object[...]; >> for (int i = 0; i < a.length; i++) { >> a[i] = ...; // barrier elided only after this changeset >> } >> >> >> or eliding barriers from array initialization writes with unknown array index: >> >> >> Object[] a = new Object[...]; >> a[index] = ...; // barrier elided only after this changeset >> >> >> The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_inde... > > Roberto Casta?eda Lozano has updated the pull request incrementally with two additional commits since the last revision: > > - Add some more tests to exercise barrier elision for atomic operations > - Elide barriers from atomic operations on newly allocated objects as well LGTM. TestG1BarrierGeneration.java has passed on ppc64le. I'll run more tests. Please remember updating the Copyright headers. ------------- PR Review: https://git.openjdk.org/jdk/pull/23235#pullrequestreview-2596076238 From epeter at openjdk.org Wed Feb 5 15:19:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 15:19:10 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 19:34:40 GMT, Aleksey Shipilev wrote: >>> This is the fix (band-aid) to unlock us here. #23363 >> >> Ack. Retesting. > >> > This is the fix (band-aid) to unlock us here. #23363 >> >> Ack. Retesting. > > Nevermind. I thought it was integrated. I'll wait some more :) @shipilev So, we got https://github.com/openjdk/jdk/pull/23363 integrated :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2637206292 From shade at openjdk.org Wed Feb 5 15:23:11 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 5 Feb 2025 15:23:11 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v3] In-Reply-To: References: Message-ID: <0SH6l39by6p14bBZKtLgoXvMX9aeWZB8IfEYiwxoLe0=.345b828c-0b73-45d2-86d5-c76ed0f432aa@github.com> On Tue, 4 Feb 2025 19:35:12 GMT, Aleksey Shipilev wrote: >> We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. >> >> I think we need to run CTW in the mode that exposes more code to the compiler optimizations. >> >> Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. >> >> Additional testing: >> - [ ] Linux x86-64 server fastdebug, `applications/ctw/modules` >> - [ ] Linux AArch64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Also do markMethodProfiled for extra scope > - Fix Yay. I have remerged `master` here, and `applications/ctw/modules` seem to pass at least on my desktop. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2637219947 From shade at openjdk.org Wed Feb 5 15:23:07 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 5 Feb 2025 15:23:07 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v4] In-Reply-To: References: Message-ID: > We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. > > I think we need to run CTW in the mode that exposes more code to the compiler optimizations. > > Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. > > Additional testing: > - [ ] Linux x86-64 server fastdebug, `applications/ctw/modules` > - [ ] Linux AArch64 server fastdebug, `applications/ctw/modules` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps - Also do markMethodProfiled for extra scope - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23296/files - new: https://git.openjdk.org/jdk/pull/23296/files/78fa50c4..c6d1ff12 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23296&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23296&range=02-03 Stats: 2931 lines in 289 files changed: 1803 ins; 528 del; 600 mod Patch: https://git.openjdk.org/jdk/pull/23296.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23296/head:pull/23296 PR: https://git.openjdk.org/jdk/pull/23296 From jkarthikeyan at openjdk.org Wed Feb 5 15:25:12 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 5 Feb 2025 15:25:12 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts In-Reply-To: <7YOG7L0dZP0iEJzDPMW62pv8QYaZ8dJbfn3FbK5T-Vw=.02a740fe-31a6-484f-ad82-412cdca80155@github.com> References: <7YOG7L0dZP0iEJzDPMW62pv8QYaZ8dJbfn3FbK5T-Vw=.02a740fe-31a6-484f-ad82-412cdca80155@github.com> Message-ID: On Wed, 5 Feb 2025 09:25:07 GMT, Emanuel Peter wrote: >> Originally I thought that larger type conversions were only available on AVX-512 based on the ad-file, but reading it more carefully now I see that they are indeed supported for AVX! I'll go ahead and generalize the changes further to make casting to larger types supported as well. I think this leaves `char` as the only type not supported, which I can look at in a follow-up RFE. > > @jaskarth @eme64 I took a look last night and it looks like supporting casting to larger types (other than `char`) just involves changing the filtering logic in `matcher_x86.hpp` and adding more unit tests, without modifying any of the core logic. I think it could make sense to keep it in the same patch, what do you think? Either way would work with me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1943166874 From epeter at openjdk.org Wed Feb 5 15:36:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 15:36:12 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts In-Reply-To: References: <7YOG7L0dZP0iEJzDPMW62pv8QYaZ8dJbfn3FbK5T-Vw=.02a740fe-31a6-484f-ad82-412cdca80155@github.com> Message-ID: On Wed, 5 Feb 2025 15:22:05 GMT, Jasmine Karthikeyan wrote: >> @jaskarth > > @eme64 I took a look last night and it looks like supporting casting to larger types (other than `char`) just involves changing the filtering logic in `matcher_x86.hpp` and adding more unit tests, without modifying any of the core logic. I think it could make sense to keep it in the same patch, what do you think? Either way would work with me. I leave it up to you. If it's not much then pack it in the same, else splitting is fine with me too ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1943188169 From roland at openjdk.org Wed Feb 5 15:41:15 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 5 Feb 2025 15:41:15 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 06:44:47 GMT, Tobias Hartmann wrote: > Fails to build on Mac AArch64: > > ``` > [2025-02-05T06:43:04,925Z] * For target hotspot_variant-server_libjvm_objs_mulnode.o: > [2025-02-05T06:43:04,925Z] [...]workspace/open/src/hotspot/share/opto/mulnode.cpp:1400:13: error: use of bitwise '&' with boolean operands [-Werror,-Wbitwise-instead-of-logical] > [2025-02-05T06:43:04,925Z] assert((checked_cast(lo) == lo_verify) & (checked_cast(hi) == hi_verify), "inconsistent"); > ``` Thanks for the report. Should be fixed now. I also took @eme64's comments into account. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23438#issuecomment-2637271864 From roland at openjdk.org Wed Feb 5 15:43:25 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 5 Feb 2025 15:43:25 GMT Subject: RFR: 8341976: C2: use_mem_state != load->find_exact_control(load->in(0)) assert failure Message-ID: The `arraycopy` writes to a non escaping array so its `ArrayCopy` node is marked as having a narrow memory effect. One of the loads from the destination after the copy is transformed into a load from the source array (the rationale being that if there's no load from the destination of the copy, the `arraycopy` is not needed). The load from the source has the input memory state of the `ArrayCopy` as memory input. That load is then sunk out of the loop and its control is updated to be after the `ArrayCopy`. That's legal because the `ArrayCopy` only has a narrow memory effect and can't modify the source. The `ArrayCopy` can't be eliminated and is expanded. In the process, a `MemBar` that has a wide memory effect is added. The load from the source has control after the membar but memory state before and because the membar has a wide memory effect, the load is anti dependent on the membar: the graph is broken (the load can't be pinned after the membar and anti dependent on it). In short, the problem is that the graph is transformed under the assumption that the `ArrayCopy` has a narrow effect but the `ArrayCopy` is expanded to a subgraph that has a wide memory effect. The fix I propose is to not insert a membar with a wide memory effect. We still need a membar when the destination is non escaping because the expanded `ArrayCopy`, if it writes to a tighly allocated array, writes to raw memory and not to the destination memory slice. ------------- Commit messages: - whitespace - fix & test Changes: https://git.openjdk.org/jdk/pull/23465/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23465&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8341976 Stats: 89 lines in 3 files changed: 82 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23465.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23465/head:pull/23465 PR: https://git.openjdk.org/jdk/pull/23465 From shade at openjdk.org Wed Feb 5 15:52:14 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 5 Feb 2025 15:52:14 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v4] In-Reply-To: References: Message-ID: <6iaDfKTzIXQQuQhyX_sNP-fgbEhzej578ZOfbqLOLDg=.e60c00e8-b92f-4815-aa92-9804201aa843@github.com> On Wed, 5 Feb 2025 15:23:07 GMT, Aleksey Shipilev wrote: >> We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. >> >> I think we need to run CTW in the mode that exposes more code to the compiler optimizations. >> >> Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. >> >> Additional testing: >> - [x] Linux x86-64 server fastdebug, `applications/ctw/modules` >> - [x] Linux AArch64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Also do markMethodProfiled for extra scope > - Fix `applications/ctw/modules` passes for me on bigger machines as well, so I think we are ready to go. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2637306557 From epeter at openjdk.org Wed Feb 5 15:58:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 15:58:12 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v4] In-Reply-To: References: Message-ID: <_aNRIvjPG9HedRvscXR93qIYere4JWeetbA3KnwOOvo=.7b4ddf03-a7be-4f6a-b283-83f4494e5bdb@github.com> On Tue, 4 Feb 2025 17:59:47 GMT, Vladimir Kozlov wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps >> - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps >> - Also do markMethodProfiled for extra scope >> - Fix > > Good. I leave this to @vnkozlov and @TobiHartmann to test and review on our side. But looks exciting ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2637329325 From mli at openjdk.org Wed Feb 5 16:02:13 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 5 Feb 2025 16:02:13 GMT Subject: RFR: 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 [v2] In-Reply-To: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> References: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> Message-ID: On Wed, 5 Feb 2025 10:36:52 GMT, Gui Cao wrote: >> Hi, please review this small change fixing an assertion error. >> As the alignment of the loading addresses is only ensured under -XX:-AvoidUnalignedAccesses, we should only enable the related assersions about the alignment under this option. >> >> >> ### Testing >> - [x] Sanity tested with -XX:-AvoidUnalignedAccesses using fastdebug build. >> - [ ] Run tier1 tests on SOPHON SG2042 (fastdebug) >> - [ ] Run tier1-3 tests on Milk-V Megrez (release) > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix build nice, I'm trying to send out a pr for it too. ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23459#pullrequestreview-2596261477 From jkarthikeyan at openjdk.org Wed Feb 5 16:03:22 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 5 Feb 2025 16:03:22 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v2] In-Reply-To: References: Message-ID: <1kjZnYmjzNrXuXFsPlpCY_LAPHEQz30i_RpDmr3Xh80=.307d3d09-c4d2-4211-8e0b-5e8beb3b8f3c@github.com> On Wed, 5 Feb 2025 14:11:25 GMT, Roland Westrelin wrote: >> This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and >> `Value` because the `int` and `long` versions are very similar and so >> there's no logic duplication. In the process, support for some extra >> transformations is added to `RShiftL`. I also added some new test >> cases. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review This is really nice! I'd wondered why there was no `RShiftL::Ideal`, and it's nice to have it handled it in a generic way with the integer version. I left mostly code style comments here. src/hotspot/share/opto/mulnode.cpp line 1325: > 1323: // Check for (x & 0xFF000000) >> 24, whose mask can be made smaller. > 1324: // Such expressions arise normally from shift chains like (byte)(x >> 24). > 1325: const Node *mask = in(1); Suggestion: const Node* mask = in(1); src/hotspot/share/opto/mulnode.cpp line 1367: > 1365: const Type* RShiftNode::ValueIL(PhaseGVN* phase, BasicType bt) const { > 1366: const Type *t1 = phase->type(in(1)); > 1367: const Type *t2 = phase->type(in(2)); Suggestion: const Type* t1 = phase->type(in(1)); const Type* t2 = phase->type(in(2)); src/hotspot/share/opto/mulnode.cpp line 1399: > 1397: assert(lo <= hi, "must have valid bounds"); > 1398: #ifdef ASSERT > 1399: if (bt ==T_INT) { Suggestion: if (bt == T_INT) { Could this assert be generic to also handle T_LONG too? src/hotspot/share/opto/mulnode.cpp line 1462: > 1460: return progress; > 1461: } > 1462: const TypeInt *t3; // type of in(1).in(2) Suggestion: const TypeInt* t3; // type of in(1).in(2) src/hotspot/share/opto/mulnode.cpp line 1517: > 1515: } > 1516: > 1517: Node *RShiftLNode::Ideal(PhaseGVN *phase, bool can_reshape) { Suggestion: Node* RShiftLNode::Ideal(PhaseGVN *phase, bool can_reshape) { src/hotspot/share/opto/mulnode.hpp line 322: > 320: virtual Node* Identity(PhaseGVN* phase); > 321: > 322: virtual Node* Ideal(PhaseGVN *phase, bool can_reshape); Suggestion: virtual Node* Ideal(PhaseGVN* phase, bool can_reshape); src/hotspot/share/opto/type.cpp line 1533: > 1531: } > 1532: > 1533: const TypeInteger *TypeInteger::make(jlong lo, BasicType bt) { Suggestion: const TypeInteger* TypeInteger::make(jlong lo, BasicType bt) { src/hotspot/share/utilities/globalDefinitions.hpp line 799: > 797: return BitsPerJavaInteger; > 798: } > 799: return BitsPerJavaLong; I think it'd be nice to add `assert(bt == T_LONG, "unsupported");` before the last return, like in the helper methods above. ------------- PR Review: https://git.openjdk.org/jdk/pull/23438#pullrequestreview-2596217801 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1943215091 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1943210959 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1943233567 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1943209867 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1943209437 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1943219249 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1943207959 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1943208909 From dlunden at openjdk.org Wed Feb 5 16:21:11 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 5 Feb 2025 16:21:11 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 14:32:11 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/gcm.cpp line 831: >> >>> 829: // | 8 membar_release <- 7 | early >>> 830: // | ... | >>> 831: // +-----------------------+ >> >> I just discussed this example with @chhagedorn >> >> // Patch the existing phi to select an input from the merge: >> // Phi:AT1(...MergeMem(m0, m1, m2)...) into >> // Phi:AT1(...m1...) >> int alias_idx = phase->C->get_alias_index(at); >> >> And >> >> // Phi(...MergeMem(m0, m1:AT1, m2:AT2)...) into >> // MergeMem(Phi(...m0...), Phi:AT1(...m1...), Phi:AT2(...m2...)) >> >> >> In `cfgnode.cpp`, we try to move the MergeMem after Phi. Why does this not happen in this example? >> >> There are many cases in that code... but it seems to me that here something may be missing. I have not given it more time though. >> >> If we knew that MergeMem always happened after the Phi, then we could only search from the `initial_mem`, and would walk through all relevant MergeMem, right? >> >> This is just an intuition, but maybe having MergeMem after Phi is a fundamental assumption. Or maybe it just happens in all cases, and yours is the only we found so far where that is not possible. >> >> What do you think? > > @dlunde I really don't want to block you here. I never understood the memory graph above the initial mem. Now that I see the example I'm getting new ideas ? Thanks for the comment @eme64 @chhagedorn! Happy to iterate, never hesitate to provide comments. I do recall we discussed these MergeMem/Phi swap idealizations offline last week. I think this looks very promising. Looking at the two rules you mention and applying them iteratively to our example 7 Phi(3 MergeMem(1:A, 2:L), 5 MergeMem(1:A, 4:L)) I get 7 Phi(3 MergeMem(1:A, 2:L), 5 MergeMem(1:A, 4:L)) into MergeMem(Phi:A(1:A, 5 MergeMem(1:A, 4:L)), Phi:L(2:L, 5 MergeMem(1:A, 4:L))) into MergeMem(MergeMem(Phi:A(1:A, 1:A), Phi:L(1:A, 4:L)), Phi:L(2:L, 5 MergeMem(1:A, 4:L))) into MergeMem(MergeMem(Phi:A(1:A, 1:A), Phi:L(1:A, 4:L)), Phi:L(2:L, 4:L))) Then, after this, we should be able to merge the resulting `Phi:L(2:L, 4:L)` with 6 Phi (`initial_mem`). So, essentially, we have broken out the `L` part of `7 Phi` and realized it is the same as `6 Phi`. I guess this is what you are also saying? For EXAMPLE 2: 4 Phi(1:A, 3 MergeMem(1:A, 2:!L)) into MergeMem(Phi(1:A, 1:A), Phi(1:A, 2:!L)) `Phi(1:A, 1:A)` is `1:A` so then we have a Phi-free path from `1 MachProj` to `5 membar_release` as well! I'll have a look and see if I can figure out why we do not apply such idealizations here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1943240611 From epeter at openjdk.org Wed Feb 5 16:21:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 16:21:12 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 16:02:59 GMT, Daniel Lund?n wrote: >> @dlunde I really don't want to block you here. I never understood the memory graph above the initial mem. Now that I see the example I'm getting new ideas ? > > Thanks for the comment @eme64 @chhagedorn! Happy to iterate, never hesitate to provide comments. I do recall we discussed these MergeMem/Phi swap idealizations offline last week. > > I think this looks very promising. Looking at the two rules you mention and applying them iteratively to our example > > 7 Phi(3 MergeMem(1:A, 2:L), 5 MergeMem(1:A, 4:L)) > > I get > > 7 Phi(3 MergeMem(1:A, 2:L), 5 MergeMem(1:A, 4:L)) into > MergeMem(Phi:A(1:A, 5 MergeMem(1:A, 4:L)), > Phi:L(2:L, 5 MergeMem(1:A, 4:L))) into > MergeMem(MergeMem(Phi:A(1:A, 1:A), Phi:L(1:A, 4:L)), > Phi:L(2:L, 5 MergeMem(1:A, 4:L))) into > MergeMem(MergeMem(Phi:A(1:A, 1:A), Phi:L(1:A, 4:L)), > Phi:L(2:L, 4:L))) > > Then, after this, we should be able to merge the resulting `Phi:L(2:L, 4:L)` with 6 Phi (`initial_mem`). So, essentially, we have broken out the `L` part of `7 Phi` and realized it is the same as `6 Phi`. I guess this is what you are also saying? > > For EXAMPLE 2: > > 4 Phi(1:A, 3 MergeMem(1:A, 2:!L)) into > MergeMem(Phi(1:A, 1:A), Phi(1:A, 2:!L)) > > `Phi(1:A, 1:A)` is `1:A` so then we have a Phi-free path from `1 MachProj` to `5 membar_release` as well! > > I'll have a look and see if I can figure out why we do not apply such idealizations here. That sounds about right, yes! Thanks for persisting here. I'm really looking forward to what you find ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1943254560 From roland at openjdk.org Wed Feb 5 16:49:26 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 5 Feb 2025 16:49:26 GMT Subject: RFR: 8349479: C2: when a Type node becomes dead, make CFG path that uses it unreachable Message-ID: This is primarily motivated by 8275202 (C2: optimize out more redundant conditions). In the following code snippet: int[] array = new int[arraySize]; if (j <= arraySize) { if (i >= 0) { if (i < j) { int v = array[i]; (`arraySize` is a constant) at the range check, `j` is known to be in `[min, arraySize]` as a consequence, `i` is known to be `[0, arraySize-1]`. The range check can be eliminated. Now, if later, `i` constant folds to some value that's positive but out of range for the array: - if that happens when the new pass runs, then it can prove that: if (i < j) { is never taken. - if that happens during IGVN or CCP however, that condition is not constant folded. And because the range check was removed, there's no guard protecting the range check `CastII`. It becomes `top` and, as a result, the graph can become broken. What I propose here is that when the `CastII` becomes dead, any CFG paths that use the `CastII` node is made unreachable. So in pseudo code: int[] array = new int[arraySize]; if (j <= arraySize) { if (i >= 0) { if (i < j) { halt(); Finding the CFG paths is implemented in the patch by following the uses of the node until a CFG node or a `Phi` is encountered. The patch applies this to all `Type` nodes as with 8275202, I also ran in some rare corner cases with other types of nodes. The exception is `Phi` nodes which may not be as easy to handle (and for which I had no issue with 8275202). Finally, the patch includes a test case that's unrelated to the discussion of 8275202 above. In that test case, a `CastII` becomes top but the test that guards it doesn't constant fold. The root cause is a transformation of: (CastII (AddI into (AddI (CastII ) (CastII)` which causes the resulting node to have a wider type. The `CastII` captures a type before the transformation above happens. Once it has happened, the guard for the `CastII` can't be constant folded when an out of bound value occurs. This is likely fixable some other way (eventhough it doesn't seem straightforward). Given the long history of similar issues (and the test case that shows that they are more hiding), I think it would make sense to try some other way of approaching them. ------------- Commit messages: - whitespace - fix & test Changes: https://git.openjdk.org/jdk/pull/23468/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23468&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349479 Stats: 192 lines in 9 files changed: 191 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23468.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23468/head:pull/23468 PR: https://git.openjdk.org/jdk/pull/23468 From mablakatov at openjdk.org Wed Feb 5 17:09:16 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 5 Feb 2025 17:09:16 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v2] In-Reply-To: <2jvFY4hq9FPdk9e4Zg6LRPdRVhDTGgxofL-we8c-mns=.4e6ce509-67a4-4e46-a661-2b0951f88731@github.com> References: <2jvFY4hq9FPdk9e4Zg6LRPdRVhDTGgxofL-we8c-mns=.4e6ce509-67a4-4e46-a661-2b0951f88731@github.com> Message-ID: On Tue, 4 Feb 2025 18:52:55 GMT, Emanuel Peter wrote: >> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: >> >> Use EXT instead of COMPACT to split a vector into two halves >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> Fujitsu A64FX (SVE 512-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139: > >> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD >> 2138: // instructions are used. >> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc, > > Drive-by question: > This is recursive folding: take halve the vector and add it that way. > > What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences. We have strictly-ordered intrinsics for add reduction: https://github.com/openjdk/jdk/blob/19399d271ef00f925232fbbe9087b5772f2fca01/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2903 Neither of Arm64 Neon/SVE/SVE2 have a dedicated mul reduction instruction, thus it's implemented recursively whereas strict ordering isn't required (for Vector API). For auto-vectorization we impose `_requires_strict_order` on `MulReductionVFNode`, `MulReductionVDNode`. Although I suspect that we might have missed something as I see a speedup for `VectorReduction2.WithSuperword.doubleMulBig` / `floatMulBig` which I didn't expect to be the case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1943343335 From duke at openjdk.org Wed Feb 5 17:28:17 2025 From: duke at openjdk.org (Matthias Ernst) Date: Wed, 5 Feb 2025 17:28:17 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 09:19:01 GMT, Emanuel Peter wrote: > Mabe we can adjust the name of `AndIL_is_zero_element_under_mask` somehow, so that the assumption is made explicit? My math isn't super strong. I do find the name "zero element" more compelling to the previous "is_always_zero", since zero element seems to be a well-known term: https://en.wikipedia.org/wiki/Zero_element . That page also uses "Additive Identity", happy to use that term, or, "is_neutral_wrt_addition_under_mask" . The important part is, as you and @merykitty much earlier correctly point out, is that `add1&mask==0` is _not_ sufficient, there needs to implied a "for all" quantifier. I would be happiest if we can * explicitly stress that is_zero_lement (or whatever name we choose) can return false negatives, we're not trying to determine this for arbitrary expr/mask combos, * that it is primarily covering the case "mask = 2^X-1" aka "% 2^x" (that's what we get for alignment checks) => it means we're testing whether `expr` is a multiple of `2^X` => is "congruent to zero modulo (mask+1)" ([modular arithmetic](https://en.wikipedia.org/wiki/Modular_arithmetic)). That is the motivating (and maybe only case that matters), and we could stop there. * But we can document that this version takes it one step further, namely the same is trivially implied if there's some bits "missing" in mask. Hence the check is whether `expr` is a multiple of `2 ^ log2 mask` (equiv num_trailing_bits >= mask_width). ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2637569499 From dhanalla at openjdk.org Wed Feb 5 17:34:00 2025 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Wed, 5 Feb 2025 17:34:00 GMT Subject: RFR: 8341293: Split field loads through Nested Phis [v7] In-Reply-To: References: <18TQt6vxN9KxSVwyeQtAWde-ezaVuUEioAl_5_3sAeE=.e5e76fb6-04a7-4f6f-9377-f1e64837ada6@github.com> Message-ID: <-Fs-Nim4P8TQMnjE9bs2HBY34vQtzhzH2dsU7MDlZrI=.34991658-bed4-46ac-b213-f4988c0f9c8b@github.com> On Tue, 4 Feb 2025 19:03:10 GMT, Emanuel Peter wrote: > @dhanalla Would you like this to be reviewed? We generally don't re-review until we get pinged again. The idea is that you are maybe still working on it, and so there is no point in reviewing half-processed code. So once you are happy, you can let us know ;) Thanks, @eme64 for checking with me. Yes, it's ready for review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21270#issuecomment-2637581366 From dlunden at openjdk.org Wed Feb 5 17:40:24 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 5 Feb 2025 17:40:24 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 16:11:15 GMT, Emanuel Peter wrote: >> Thanks for the comment @eme64 @chhagedorn! Happy to iterate, never hesitate to provide comments. I do recall we discussed these MergeMem/Phi swap idealizations offline last week. >> >> I think this looks very promising. Looking at the two rules you mention and applying them iteratively to our example >> >> 7 Phi(3 MergeMem(1:A, 2:L), 5 MergeMem(1:A, 4:L)) >> >> I get >> >> 7 Phi(3 MergeMem(1:A, 2:L), 5 MergeMem(1:A, 4:L)) into >> MergeMem(Phi:A(1:A, 5 MergeMem(1:A, 4:L)), >> Phi:L(2:L, 5 MergeMem(1:A, 4:L))) into >> MergeMem(MergeMem(Phi:A(1:A, 1:A), Phi:L(1:A, 4:L)), >> Phi:L(2:L, 5 MergeMem(1:A, 4:L))) into >> MergeMem(MergeMem(Phi:A(1:A, 1:A), Phi:L(1:A, 4:L)), >> Phi:L(2:L, 4:L))) >> >> Then, after this, we should be able to merge the resulting `Phi:L(2:L, 4:L)` with 6 Phi (`initial_mem`). So, essentially, we have broken out the `L` part of `7 Phi` and realized it is the same as `6 Phi`. I guess this is what you are also saying? >> >> For EXAMPLE 2: >> >> 4 Phi(1:A, 3 MergeMem(1:A, 2:!L)) into >> MergeMem(Phi(1:A, 1:A), Phi(1:A, 2:!L)) >> >> `Phi(1:A, 1:A)` is `1:A` so then we have a Phi-free path from `1 MachProj` to `5 membar_release` as well! >> >> I'll have a look and see if I can figure out why we do not apply such idealizations here. > > That sounds about right, yes! Thanks for persisting here. I'm really looking forward to what you find ? Interestingly, this is the line in `cfgnode.cpp` that blocks the MergeMem/Phi swap idealization: // This restriction is temporarily necessary to ensure termination: if (!saw_self && adr_type() == TypePtr::BOTTOM) merge_width = 0; If I comment out the line it solves all the failures we have seen. I double-checked that we then perform the exact MergeMem/Phi swap idealizations discussed above. I am wondering what the proper solution is here. I will, of course, investigate if it is possible to loosen the restriction and still ensure termination. On the other hand, it also seems strange that the anti-dependence search is so sensitive to missing idealizations? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1943389830 From amitkumar at openjdk.org Wed Feb 5 17:57:19 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 5 Feb 2025 17:57:19 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> Message-ID: On Wed, 5 Feb 2025 12:38:06 GMT, Roberto Casta?eda Lozano wrote: >> G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. >> >> The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: >> >> >> o = new MyObject(); >> if (...) { >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the if condition) >> } >> >> >> or in initialization writes placed after exception-throwing checks: >> >> >> o = new MyObject(); >> if (...) { >> throw new Exception(""); >> } >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the above if condition) >> >> >> These patterns are commonly found in Java code, e.g. in the core libraries: >> >> - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or >> >> - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). >> >> The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): >> >> >> Object[] a = new Object[...]; >> for (int i = 0; i < a.length; i++) { >> a[i] = ...; // barrier elided only after this changeset >> } >> >> >> or eliding barriers from array initialization writes with unknown array index: >> >> >> Object[] a = new Object[...]; >> a[index] = ...; // barrier elided only after this changeset >> >> >> The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_inde... > > Roberto Casta?eda Lozano has updated the pull request incrementally with two additional commits since the last revision: > > - Add some more tests to exercise barrier elision for atomic operations > - Elide barriers from atomic operations on newly allocated objects as well I see TestG1BarrierGeneration.java failure :( [TestG1BarrierGeneration_jtr.log](https://github.com/user-attachments/files/18676532/TestG1BarrierGeneration_jtr.log) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2637624720 From duke at openjdk.org Wed Feb 5 17:57:21 2025 From: duke at openjdk.org (Matthias Ernst) Date: Wed, 5 Feb 2025 17:57:21 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 07:15:56 GMT, Emanuel Peter wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> jlong, not long > > src/hotspot/share/opto/mulnode.cpp line 2059: > >> 2057: >> 2058: // Returns a lower bound on the number of trailing zeros in expr, or -1 if the number >> 2059: // cannot be determined. > > Why not just return `0` if we cannot determine it? That would still be a correct lower bound, right? Can run into problematic corner case depending on order of checks, see other comment. > src/hotspot/share/opto/mulnode.cpp line 2104: > >> 2102: static bool AndIL_is_zero_element_under_mask(const PhaseGVN* phase, const Node* expr, const Node* mask, BasicType bt) { >> 2103: jint expr_trailing_zeros = AndIL_min_trailing_zeros(phase, expr, bt); >> 2104: if (expr_trailing_zeros < 0) { > > It feels a little strange that the number of trailing zeros could be negative... > That's why I would return 0 if we can prove nothing. It is still clear that we can do nothing here if it is zero, so we can just compare `<= 0`. > > Or what was the reason for returning `-1`? Two motivations here, that were too implicit, both hinging on the "0 >= 0" case: * if we just start with "if (mask==0) return true", then we can crash later with a `not monotonic` error in case `expr` is in some form of non-well-formed state (e.g. when shift_t == nullptr). That's what I was trying to distinguish. Not very obvious. * if we return false on trailing_zeros == 0, it appears like we don't handle the (edge case) "(x + 3) & 0". That irks a reader, but I do agree we should punt this case because that will be handled downstream anyway (the whole AndNode goes away). I think it's best to reorder as follows: // we're only trying to cover actual shifts, not << 0 if (mask_lo < 0 || mask_hi == 0) return false // mask == 0 handled in MulNode::Ideal mask_width = 64 - count_leading_zeros(mask_hi) return trailing_zeros(expr) >= mask_width // don't need to worry about the 0>=0 case here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1943398162 PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1943322101 From aph at openjdk.org Wed Feb 5 17:59:21 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 5 Feb 2025 17:59:21 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v2] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 11:20:59 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: > > Use EXT instead of COMPACT to split a vector into two halves > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms > IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms > LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms > > Fujitsu A64FX (SVE 512-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms > IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms > LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1915: > 1913: %} > 1914: > 1915: instruct reduce_mulD(vRegD dst, vRegD dsrc, vReg vsrc, vReg tmp) %{ Please consider that `reduce_mulF_gt128b` and `reduce_mulD_gt128b` might be similar enough that they should be combined in the same way as other patterns in this file. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1943420223 From kvn at openjdk.org Wed Feb 5 20:15:10 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Feb 2025 20:15:10 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v3] In-Reply-To: References: Message-ID: <1GBWBQfWNIwLEF26VW0tecseBegwuuRUDG-rNg1zdoU=.63aa4380-5c75-45ac-86dd-9c9fe308b9dc@github.com> On Tue, 4 Feb 2025 20:56:53 GMT, Tom Rodriguez wrote: >> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. > > Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: > > improve comments Seems fine. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23444#pullrequestreview-2596877352 From duke at openjdk.org Wed Feb 5 20:27:58 2025 From: duke at openjdk.org (Matthias Ernst) Date: Wed, 5 Feb 2025 20:27:58 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v18] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review Co-authored-by: Emanuel Peter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/58375582..c39d2234 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=16-17 Stats: 33 lines in 1 file changed: 13 ins; 0 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From duke at openjdk.org Wed Feb 5 21:19:56 2025 From: duke at openjdk.org (Matthias Ernst) Date: Wed, 5 Feb 2025 21:19:56 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v19] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: Comments, "Proof", order of checks. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/c39d2234..1d23c1a4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=18 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=17-18 Stats: 52 lines in 1 file changed: 18 ins; 12 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From duke at openjdk.org Wed Feb 5 21:24:17 2025 From: duke at openjdk.org (Matthias Ernst) Date: Wed, 5 Feb 2025 21:24:17 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v19] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 21:19:56 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > Comments, "Proof", order of checks. Thanks for the suggestions. Applied your edits, made some of my own, lmk what you think. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2638052823 From duke at openjdk.org Wed Feb 5 21:24:18 2025 From: duke at openjdk.org (Matthias Ernst) Date: Wed, 5 Feb 2025 21:24:18 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v17] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 08:15:42 GMT, Emanuel Peter wrote: >> I would leave any details about `addition` to the use of this function, and any discussion how we find the trailing zeros to `AndIL_min_trailing_zeros`. Otherwise it's a little confusing. >> >> It's nice to have examples, and give the reader an intuition of what you are doing in the logic below. > > Feel free to tweak the description further ;) > I would leave any details about addition to the use of this function I don't think that works well, since the very definition of "what is a zero" is defined in terms of being a neutral element in an addition. So I do think that discussion needs to be here and not in the caller. The caller just reduces the addition. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1943676182 From dlong at openjdk.org Wed Feb 5 22:06:09 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 5 Feb 2025 22:06:09 GMT Subject: RFR: 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 13:44:10 GMT, Andrew Dinn wrote: > ?assert(allocates2(pc)) failed: not in CodeBuffer memory > > The StubGenenerator compiler blob runs out of space when TestCodeEntryAlignment is run on macos/x86_64 on an avx2-only CPU. This only happens in the worst case with command line options `-XX:CodeCacheSegmentSize=1024 -XX:CodeEntryAlignment=1024`. > > On linux/x86_64 the test succeeds in that worst case when run on an avx512-enabled CPU but with only 980 bytes of headroom. > > This patch increments the buffer size on x86_64 to ensure both the avx2 and avx3 cases have enough headroom. > > n.b. the increment has deliberately been made x86_64-specific rather than macos-specific, even though this problem manifests when testing MacOS and does not manifest when testing Linux. The disparity in generated stubs size actually relates to the capabilities of the CPU and is independent of OS. This seems fine, but it has always bothered me that we have to set and occasionally adjust these constant sizes at all. If we were multi-threaded when doing the allocations I guess it would make sense, but since we are single-threaded, why not use the entire available space and then give back whatever we don't use? ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23439#pullrequestreview-2597084542 From duke at openjdk.org Wed Feb 5 23:36:00 2025 From: duke at openjdk.org (Nicole Xu) Date: Wed, 5 Feb 2025 23:36:00 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException Message-ID: Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. Additionally, some defined but unused variables have been removed. ------------- Commit messages: - 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException Changes: https://git.openjdk.org/jdk/pull/22963/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22963&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8346954 Stats: 14 lines in 1 file changed: 0 ins; 9 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/22963.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22963/head:pull/22963 PR: https://git.openjdk.org/jdk/pull/22963 From duke at openjdk.org Wed Feb 5 23:36:00 2025 From: duke at openjdk.org (Nicole Xu) Date: Wed, 5 Feb 2025 23:36:00 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Wed, 8 Jan 2025 09:04:47 GMT, Nicole Xu wrote: > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: > > > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > > > The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. > > Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. > > Additionally, some defined but unused variables have been removed. I am a member of Nvidia Java compiler team. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2577153564 From duke at openjdk.org Wed Feb 5 23:36:00 2025 From: duke at openjdk.org (Nicole Xu) Date: Wed, 5 Feb 2025 23:36:00 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Wed, 8 Jan 2025 09:08:55 GMT, Nicole Xu wrote: > I am a member of Nvidia Java compiler team. BTW, Nvidia has signed the OCA recently. Please help to check. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2581667116 From epeter at openjdk.org Wed Feb 5 23:36:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 23:36:00 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Wed, 8 Jan 2025 09:04:47 GMT, Nicole Xu wrote: > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: > > > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > > > The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. > > Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. > > Additionally, some defined but unused variables have been removed. @jatin-bhateja Could you have a look at these changes? You wrote the test originally. Oh, the OCA-verify is still stuck. I'm sorry about that ? I pinged my manager @TobiHartmann , he will reach out to see what's the issue. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2597602723 PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2634806377 From epeter at openjdk.org Wed Feb 5 23:36:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 23:36:00 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Fri, 10 Jan 2025 03:25:01 GMT, Nicole Xu wrote: >> I am a member of Nvidia Java compiler team. > >> I am a member of Nvidia Java compiler team. > > BTW, Nvidia has signed the OCA recently. Please help to check. Thanks. @xyyNicole I see this has been in OCA-verify mode for 2 weeks. I reached out internally and hope this can go through soon. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2606527743 From liach at openjdk.org Wed Feb 5 23:45:25 2025 From: liach at openjdk.org (Chen Liang) Date: Wed, 5 Feb 2025 23:45:25 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray Message-ID: `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. ------------- Commit messages: - Update bug id - copyright years and comments - Consolidate multi-byte io into ByteArray Changes: https://git.openjdk.org/jdk/pull/23478/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23478&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349503 Stats: 2042 lines in 17 files changed: 714 ins; 1134 del; 194 mod Patch: https://git.openjdk.org/jdk/pull/23478.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23478/head:pull/23478 PR: https://git.openjdk.org/jdk/pull/23478 From fyang at openjdk.org Thu Feb 6 02:43:12 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 6 Feb 2025 02:43:12 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> Message-ID: On Wed, 5 Feb 2025 12:38:06 GMT, Roberto Casta?eda Lozano wrote: >> G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. >> >> The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: >> >> >> o = new MyObject(); >> if (...) { >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the if condition) >> } >> >> >> or in initialization writes placed after exception-throwing checks: >> >> >> o = new MyObject(); >> if (...) { >> throw new Exception(""); >> } >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the above if condition) >> >> >> These patterns are commonly found in Java code, e.g. in the core libraries: >> >> - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or >> >> - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). >> >> The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): >> >> >> Object[] a = new Object[...]; >> for (int i = 0; i < a.length; i++) { >> a[i] = ...; // barrier elided only after this changeset >> } >> >> >> or eliding barriers from array initialization writes with unknown array index: >> >> >> Object[] a = new Object[...]; >> a[index] = ...; // barrier elided only after this changeset >> >> >> The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_inde... > > Roberto Casta?eda Lozano has updated the pull request incrementally with two additional commits since the last revision: > > - Add some more tests to exercise barrier elision for atomic operations > - Elide barriers from atomic operations on newly allocated objects as well FYI: hs-tier1 still test good on linux-riscv64 with fastdebug build. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2638686602 From thartmann at openjdk.org Thu Feb 6 06:43:16 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 6 Feb 2025 06:43:16 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v9] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 10:11:36 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 32 commits: > > - TestMemorySegment test fix > - test wip > - Merge branch 'master' into JDK-8342692 > - refactor > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - Merge branch 'master' into JDK-8342692 > - review > - reviews > - ... and 22 more: https://git.openjdk.org/jdk/compare/3f1d9b57...7dd6fde9 Ah, right, I see that you already mentioned that above. Should we then problem list the test with this change? Testing looks clean otherwise. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-2638963241 From thartmann at openjdk.org Thu Feb 6 07:22:23 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 6 Feb 2025 07:22:23 GMT Subject: RFR: 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 13:44:10 GMT, Andrew Dinn wrote: > ?assert(allocates2(pc)) failed: not in CodeBuffer memory > > The StubGenenerator compiler blob runs out of space when TestCodeEntryAlignment is run on macos/x86_64 on an avx2-only CPU. This only happens in the worst case with command line options `-XX:CodeCacheSegmentSize=1024 -XX:CodeEntryAlignment=1024`. > > On linux/x86_64 the test succeeds in that worst case when run on an avx512-enabled CPU but with only 980 bytes of headroom. > > This patch increments the buffer size on x86_64 to ensure both the avx2 and avx3 cases have enough headroom. > > n.b. the increment has deliberately been made x86_64-specific rather than macos-specific, even though this problem manifests when testing MacOS and does not manifest when testing Linux. The disparity in generated stubs size actually relates to the capabilities of the CPU and is independent of OS. All tests passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23439#issuecomment-2639021241 From chagedorn at openjdk.org Thu Feb 6 07:47:13 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 6 Feb 2025 07:47:13 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 17:37:34 GMT, Daniel Lund?n wrote: >> That sounds about right, yes! Thanks for persisting here. I'm really looking forward to what you find ? > > Interestingly, this is the line in `cfgnode.cpp` that blocks the MergeMem/Phi swap idealization: > > // This restriction is temporarily necessary to ensure termination: > if (!saw_self && adr_type() == TypePtr::BOTTOM) merge_width = 0; > > If I comment out the line it solves all the failures we have seen. I double-checked that we then perform the exact MergeMem/Phi swap idealizations discussed above. > > I am wondering what the proper solution is here. I will, of course, investigate if it is possible to loosen the restriction and still ensure termination. On the other hand, it also seems strange that the anti-dependence search is so sensitive to missing idealizations? Thanks for having yet another look at this! > If I comment out the line it solves all the failures we have seen. I double-checked that we then perform the exact MergeMem/Phi swap idealizations discussed above. That sounds promising! Looks like this temporary restriction became quite permanent - it's from initial load. I'm wondering if that is still necessary and if so if we have tests to catch that (we would probably hit the "infinite loop in IGVN" in that case). > I am wondering what the proper solution is here. I will, of course, investigate if it is possible to loosen the restriction and still ensure termination. On the other hand, it also seems strange that the anti-dependence search is so sensitive to missing idealizations? That would be great if we can get around this termination issue somehow - if it's still a problem. I think that is very unfortunate that we might be relying on this Ideal transformation to be applied to ensure correctness later on. If it's really required, we should at least make sure to add some verification code to catch this in debug builds. You could, for example, just turn what you have now into verification code, i.e. check that we cannot find another anti dependency edge with another search root. And/Or re-apply this particular transformation for each Phi node again in the end to see if we missed some swaps. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1944246463 From gcao at openjdk.org Thu Feb 6 08:52:09 2025 From: gcao at openjdk.org (Gui Cao) Date: Thu, 6 Feb 2025 08:52:09 GMT Subject: RFR: 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 [v2] In-Reply-To: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> References: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> Message-ID: <_s9cGQeTjetjblLvzSRn7QFd088JOc-SQ9h_s-JYzHk=.fd42abbf-6e3c-4f64-84e8-b8e8e1e88486@github.com> On Wed, 5 Feb 2025 10:36:52 GMT, Gui Cao wrote: >> Hi, please review this small change fixing an assertion error. >> As the alignment of the loading addresses is only ensured under -XX:-AvoidUnalignedAccesses, we should only enable the related assersions about the alignment under this option. >> >> >> ### Testing >> - [x] Sanity tested with -XX:-AvoidUnalignedAccesses using fastdebug build. >> - [ ] Run tier1 tests on SOPHON SG2042 (fastdebug) > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix build Thanks all for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23459#issuecomment-2639182678 From rcastanedalo at openjdk.org Thu Feb 6 08:58:40 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 6 Feb 2025 08:58:40 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> Message-ID: On Wed, 5 Feb 2025 15:06:39 GMT, Martin Doerr wrote: > LGTM. TestG1BarrierGeneration.java has passed on ppc64le. I'll run more tests. Please remember updating the Copyright headers. Thanks for the reminder, updated in commit 3671f474. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2639197888 From rcastanedalo at openjdk.org Thu Feb 6 08:49:28 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 6 Feb 2025 08:49:28 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v5] In-Reply-To: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> Message-ID: > G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. > > The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: > > > o = new MyObject(); > if (...) { > o.myField = ...; // barrier elided only after this changeset > // (assuming no safepoint in the if condition) > } > > > or in initialization writes placed after exception-throwing checks: > > > o = new MyObject(); > if (...) { > throw new Exception(""); > } > o.myField = ...; // barrier elided only after this changeset > // (assuming no safepoint in the above if condition) > > > These patterns are commonly found in Java code, e.g. in the core libraries: > > - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or > > - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). > > The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): > > > Object[] a = new Object[...]; > for (int i = 0; i < a.length; i++) { > a[i] = ...; // barrier elided only after this changeset > } > > > or eliding barriers from array initialization writes with unknown array index: > > > Object[] a = new Object[...]; > a[index] = ...; // barrier elided only after this changeset > > > The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_index`, `look_through_node`, `is_{undefined|unknown|concrete}`, `get_base_and_offset`, `is_array... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Update copyright headers ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23235/files - new: https://git.openjdk.org/jdk/pull/23235/files/621a61cf..3671f474 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23235&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23235&range=03-04 Stats: 4 lines in 4 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23235.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23235/head:pull/23235 PR: https://git.openjdk.org/jdk/pull/23235 From duke at openjdk.org Thu Feb 6 08:52:10 2025 From: duke at openjdk.org (duke) Date: Thu, 6 Feb 2025 08:52:10 GMT Subject: RFR: 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 [v2] In-Reply-To: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> References: <3UFzISL6AR_wdZlxWIoBYohTg6Qa0Bgremw2cNHQ6Cg=.a2b7841c-76a0-4029-88e8-7e7095bae8d8@github.com> Message-ID: On Wed, 5 Feb 2025 10:36:52 GMT, Gui Cao wrote: >> Hi, please review this small change fixing an assertion error. >> As the alignment of the loading addresses is only ensured under -XX:-AvoidUnalignedAccesses, we should only enable the related assersions about the alignment under this option. >> >> >> ### Testing >> - [x] Sanity tested with -XX:-AvoidUnalignedAccesses using fastdebug build. >> - [ ] Run tier1 tests on SOPHON SG2042 (fastdebug) > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix build @zifeihan Your change (at version 71b6ecc8c59f0f1d6876da2236a6c83ce3d113bc) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23459#issuecomment-2639184792 From adinn at openjdk.org Thu Feb 6 09:14:06 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 6 Feb 2025 09:14:06 GMT Subject: RFR: 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 22:03:37 GMT, Dean Long wrote: >> ?assert(allocates2(pc)) failed: not in CodeBuffer memory >> >> The StubGenenerator compiler blob runs out of space when TestCodeEntryAlignment is run on macos/x86_64 on an avx2-only CPU. This only happens in the worst case with command line options `-XX:CodeCacheSegmentSize=1024 -XX:CodeEntryAlignment=1024`. >> >> On linux/x86_64 the test succeeds in that worst case when run on an avx512-enabled CPU but with only 980 bytes of headroom. >> >> This patch increments the buffer size on x86_64 to ensure both the avx2 and avx3 cases have enough headroom. >> >> n.b. the increment has deliberately been made x86_64-specific rather than macos-specific, even though this problem manifests when testing MacOS and does not manifest when testing Linux. The disparity in generated stubs size actually relates to the capabilities of the CPU and is independent of OS. > > This seems fine, but it has always bothered me that we have to set and occasionally adjust these constant sizes at all. If we were multi-threaded when doing the allocations I guess it would make sense, but since we are single-threaded, why not use the entire available space and then give back whatever we don't use? @dean-long "why not use the entire available space and then give back whatever we don't use?" We could probably do something along those lines. I'll look into it. Meanwhile I'll integrate this fix. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23439#issuecomment-2639230735 From adinn at openjdk.org Thu Feb 6 09:18:16 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 6 Feb 2025 09:18:16 GMT Subject: Integrated: 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 13:44:10 GMT, Andrew Dinn wrote: > ?assert(allocates2(pc)) failed: not in CodeBuffer memory > > The StubGenenerator compiler blob runs out of space when TestCodeEntryAlignment is run on macos/x86_64 on an avx2-only CPU. This only happens in the worst case with command line options `-XX:CodeCacheSegmentSize=1024 -XX:CodeEntryAlignment=1024`. > > On linux/x86_64 the test succeeds in that worst case when run on an avx512-enabled CPU but with only 980 bytes of headroom. > > This patch increments the buffer size on x86_64 to ensure both the avx2 and avx3 cases have enough headroom. > > n.b. the increment has deliberately been made x86_64-specific rather than macos-specific, even though this problem manifests when testing MacOS and does not manifest when testing Linux. The disparity in generated stubs size actually relates to the capabilities of the CPU and is independent of OS. This pull request has now been integrated. Changeset: 7e307916 Author: Andrew Dinn URL: https://git.openjdk.org/jdk/commit/7e307916ecbf1ae9795e42e5b5a8347daad4af8c Stats: 3 lines in 2 files changed: 0 ins; 2 del; 1 mod 8349102: Test compiler/arguments/TestCodeEntryAlignment.java failed: assert(allocates2(pc)) failed: not in CodeBuffer memory Reviewed-by: dlong ------------- PR: https://git.openjdk.org/jdk/pull/23439 From gcao at openjdk.org Thu Feb 6 09:34:17 2025 From: gcao at openjdk.org (Gui Cao) Date: Thu, 6 Feb 2025 09:34:17 GMT Subject: Integrated: 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 In-Reply-To: References: Message-ID: <1262ghBKQzqH4Ejv1vkSzHiJodFMNi1OeYuF1tlhC6c=.5f67d802-3590-4dfe-93e5-4f049c43eaca@github.com> On Wed, 5 Feb 2025 09:04:10 GMT, Gui Cao wrote: > Hi, please review this small change fixing an assertion error. > As the alignment of the loading addresses is only ensured under -XX:-AvoidUnalignedAccesses, we should only enable the related assersions about the alignment under this option. > > > ### Testing > - [x] Sanity tested with -XX:-AvoidUnalignedAccesses using fastdebug build. > - [ ] Run tier1 tests on SOPHON SG2042 (fastdebug) This pull request has now been integrated. Changeset: d85f6514 Author: Gui Cao Committer: Hamlin Li URL: https://git.openjdk.org/jdk/commit/d85f65147aeb4009742bfe401c6070d920b71b3e Stats: 35 lines in 2 files changed: 14 ins; 0 del; 21 mod 8349428: RISC-V: "bad alignment" with -XX:-AvoidUnalignedAccesses after JDK-8347489 Reviewed-by: fyang, mli ------------- PR: https://git.openjdk.org/jdk/pull/23459 From mdoerr at openjdk.org Thu Feb 6 10:13:28 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 6 Feb 2025 10:13:28 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v5] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> Message-ID: On Thu, 6 Feb 2025 08:49:28 GMT, Roberto Casta?eda Lozano wrote: >> G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. >> >> The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: >> >> >> o = new MyObject(); >> if (...) { >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the if condition) >> } >> >> >> or in initialization writes placed after exception-throwing checks: >> >> >> o = new MyObject(); >> if (...) { >> throw new Exception(""); >> } >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the above if condition) >> >> >> These patterns are commonly found in Java code, e.g. in the core libraries: >> >> - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or >> >> - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). >> >> The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): >> >> >> Object[] a = new Object[...]; >> for (int i = 0; i < a.length; i++) { >> a[i] = ...; // barrier elided only after this changeset >> } >> >> >> or eliding barriers from array initialization writes with unknown array index: >> >> >> Object[] a = new Object[...]; >> a[index] = ...; // barrier elided only after this changeset >> >> >> The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_inde... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright headers Code and test results look good. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23235#pullrequestreview-2598222122 From galder at openjdk.org Thu Feb 6 11:33:14 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 6 Feb 2025 11:33:14 GMT Subject: RFR: 8341976: C2: use_mem_state != load->find_exact_control(load->in(0)) assert failure In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 15:37:23 GMT, Roland Westrelin wrote: > The `arraycopy` writes to a non escaping array so its `ArrayCopy` node > is marked as having a narrow memory effect. One of the loads from the > destination after the copy is transformed into a load from the source > array (the rationale being that if there's no load from the > destination of the copy, the `arraycopy` is not needed). The load from > the source has the input memory state of the `ArrayCopy` as memory > input. That load is then sunk out of the loop and its control is > updated to be after the `ArrayCopy`. That's legal because the > `ArrayCopy` only has a narrow memory effect and can't modify the > source. The `ArrayCopy` can't be eliminated and is expanded. In the > process, a `MemBar` that has a wide memory effect is added. The load > from the source has control after the membar but memory state before > and because the membar has a wide memory effect, the load is anti > dependent on the membar: the graph is broken (the load can't be pinned > after the membar and anti dependent on it). > > In short, the problem is that the graph is transformed under the > assumption that the `ArrayCopy` has a narrow effect but the > `ArrayCopy` is expanded to a subgraph that has a wide memory > effect. The fix I propose is to not insert a membar with a wide memory > effect. We still need a membar when the destination is non escaping > because the expanded `ArrayCopy`, if it writes to a tighly allocated > array, writes to raw memory and not to the destination memory slice. src/hotspot/share/opto/macroArrayCopy.cpp line 831: > 829: insert_mem_bar(ctrl, &out_mem, Op_MemBarStoreStore, Compile::AliasIdxBot); > 830: } else { > 831: int alias_idx = Compile::AliasIdxBot; Minor thing, `alias_idx` is already defined in the method with a different type. Would it make sense to use a different name here? Earlier definition: uint alias_idx = C->get_alias_index(adr_type); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23465#discussion_r1944562243 From galder at openjdk.org Thu Feb 6 11:52:11 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 6 Feb 2025 11:52:11 GMT Subject: RFR: 8349479: C2: when a Type node becomes dead, make CFG path that uses it unreachable In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 16:42:02 GMT, Roland Westrelin wrote: > This is primarily motivated by 8275202 (C2: optimize out more > redundant conditions). In the following code snippet: > > > int[] array = new int[arraySize]; > if (j <= arraySize) { > if (i >= 0) { > if (i < j) { > int v = array[i]; > > > (`arraySize` is a constant) > > at the range check, `j` is known to be in `[min, arraySize]` as a > consequence, `i` is known to be `[0, arraySize-1]`. The range check > can be eliminated. > > Now, if later, `i` constant folds to some value that's positive but > out of range for the array: > > - if that happens when the new pass runs, then it can prove that: > > if (i < j) { > > is never taken. > > - if that happens during IGVN or CCP however, that condition is not > constant folded. And because the range check was removed, there's no > guard protecting the range check `CastII`. It becomes `top` and, as > a result, the graph can become broken. > > What I propose here is that when the `CastII` becomes dead, any CFG > paths that use the `CastII` node is made unreachable. So in pseudo code: > > > int[] array = new int[arraySize]; > if (j <= arraySize) { > if (i >= 0) { > if (i < j) { > halt(); > > > Finding the CFG paths is implemented in the patch by following the > uses of the node until a CFG node or a `Phi` is encountered. > > The patch applies this to all `Type` nodes as with 8275202, I also ran > in some rare corner cases with other types of nodes. The exception is > `Phi` nodes which may not be as easy to handle (and for which I had no > issue with 8275202). > > Finally, the patch includes a test case that's unrelated to the > discussion of 8275202 above. In that test case, a `CastII` becomes top > but the test that guards it doesn't constant fold. The root cause is a > transformation of: > > > (CastII (AddI > > > into > > > (AddI (CastII ) (CastII)` > > > which causes the resulting node to have a wider type. The `CastII` > captures a type before the transformation above happens. Once it has > happened, the guard for the `CastII` can't be constant folded when an > out of bound value occurs. > > This is likely fixable some other way (eventhough it doesn't seem > straightforward). Given the long history of similar issues (and the > test case that shows that they are more hiding), I think it would > make sense to try some other way of approaching them. src/hotspot/share/opto/node.cpp line 3076: > 3074: assert(r->is_Region() || r->is_top(), "unexpected Phi's control"); > 3075: if (r->is_Region()) { > 3076: for (uint k = 1; k < u->req(); ++k) { `k` already defined as `DUIterator_Fast` earlier, can we choose a different name? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23468#discussion_r1944583185 From bulasevich at openjdk.org Thu Feb 6 12:14:19 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 6 Feb 2025 12:14:19 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Mon, 3 Feb 2025 12:43:35 GMT, Stefan Karlsson wrote: >> Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Force the use of movk in combination with adrp and ldr instructions to address scenarios >> where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp > > src/hotspot/share/code/nmethod.cpp line 2162: > >> 2160: return nullptr; >> 2161: } >> 2162: return RawAccess<>::oop_load(oop_addr_at(index)); > > This change is removing the GC barriers and is likely the cause of the ZGC crash that Tobias listed. > > However, the fix is not as simple as to just reinstate the NMethodAccess call. The ZGC code uses the `oop*` to find the associated `nmethod` in the code cache. We need another way to fetch the nmethod now. So, I'm experimenting with a small change to switch out the Access API call to a direct GC barrier set call and then I pass down the `this` pointer from this function. With that you should be able skip this change. > > With that said, what was the motivation for changing this? Right, I remember I encountered a "did not find nmethod" assertion in ZNMethod::load_oop. With my change the method fails to find nmethod by oop* which is supposed to be within the boundaries of nmethod. Based on naming semantics, I thought that if the oop wasn't in the nmethod, switching from NMethodAccess to RawAccess was acceptable. In reality, this bypasses the essential GC barrier. Given that the nmethod is known in a call stack above, I think we should just pass it down instead of attempting a lookup. call stack: - ZNMethod::load_oop(oop*, unsigned long) - AccessInternal::PostRuntimeDispatch, ., .>::oop_access_barrier(void*) - AccessInternal::RuntimeDispatch<1122372ul, oop, (AccessInternal::BarrierType)2>::load_init(void*) - nmethod::oop_at_phantom(int) oop ZNMethod::load_oop(oop* p, DecoratorSet decorators) { nmethod* const nm = CodeCache::find_nmethod((void*)p); assert(nm != nullptr, "did not find nmethod"); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1944617160 From rgiulietti at openjdk.org Thu Feb 6 12:16:11 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Thu, 6 Feb 2025 12:16:11 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. What about dropping "BE" from all big-endian method names? This would reduces the number of files to review in `java.io` to 0 (admittedly, it's a rather mechanical review). I know this would be less symmetrical, but... src/java.base/share/classes/jdk/internal/util/ByteArray.java line 53: > 51: > 52: public static char getCharBO(byte[] array, int index, boolean big) { > 53: Preconditions.checkIndex(index, array.length - Character.BYTES + 1, Preconditions.AIOOBE_FORMATTER); Suggestion: Preconditions.checkIndex(index, array.length - (Character.BYTES - 1), Preconditions.AIOOBE_FORMATTER); Similarly for all cases below. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23478#issuecomment-2639657316 PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944571743 From amitkumar at openjdk.org Thu Feb 6 13:18:15 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 6 Feb 2025 13:18:15 GMT Subject: RFR: 8348520: [s390x] Problemlist TestVectorReinterpret.java In-Reply-To: References: Message-ID: On Fri, 24 Jan 2025 03:58:12 GMT, Amit Kumar wrote: > Problem listing TestVectorReinterpret.java on s390x. Thanks Martin for approval. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23288#issuecomment-2639797588 From amitkumar at openjdk.org Thu Feb 6 13:18:16 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 6 Feb 2025 13:18:16 GMT Subject: Integrated: 8348520: [s390x] Problemlist TestVectorReinterpret.java In-Reply-To: References: Message-ID: On Fri, 24 Jan 2025 03:58:12 GMT, Amit Kumar wrote: > Problem listing TestVectorReinterpret.java on s390x. This pull request has now been integrated. Changeset: dd8720e9 Author: Amit Kumar URL: https://git.openjdk.org/jdk/commit/dd8720e90dc5475afd4ccc7321bb5cd97282e101 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8348520: [s390x] Problemlist TestVectorReinterpret.java Reviewed-by: mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/23288 From stefank at openjdk.org Thu Feb 6 13:29:15 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Thu, 6 Feb 2025 13:29:15 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Thu, 6 Feb 2025 12:12:02 GMT, Boris Ulasevich wrote: >> src/hotspot/share/code/nmethod.cpp line 2162: >> >>> 2160: return nullptr; >>> 2161: } >>> 2162: return RawAccess<>::oop_load(oop_addr_at(index)); >> >> This change is removing the GC barriers and is likely the cause of the ZGC crash that Tobias listed. >> >> However, the fix is not as simple as to just reinstate the NMethodAccess call. The ZGC code uses the `oop*` to find the associated `nmethod` in the code cache. We need another way to fetch the nmethod now. So, I'm experimenting with a small change to switch out the Access API call to a direct GC barrier set call and then I pass down the `this` pointer from this function. With that you should be able skip this change. >> >> With that said, what was the motivation for changing this? > > Right, I remember I encountered a "did not find nmethod" assertion in ZNMethod::load_oop. With my change the method fails to find nmethod by oop* which is supposed to be within the boundaries of nmethod. Based on naming semantics, I thought that if the oop wasn't in the nmethod, switching from NMethodAccess to RawAccess was acceptable. In reality, this bypasses the essential GC barrier. > > Given that the nmethod is known in a call stack above, I think we should just pass it down instead of attempting a lookup. > > > call stack: > - ZNMethod::load_oop(oop*, unsigned long) > - AccessInternal::PostRuntimeDispatch, ., .>::oop_access_barrier(void*) > - AccessInternal::RuntimeDispatch<1122372ul, oop, (AccessInternal::BarrierType)2>::load_init(void*) > - nmethod::oop_at_phantom(int) > > oop ZNMethod::load_oop(oop* p, DecoratorSet decorators) { > nmethod* const nm = CodeCache::find_nmethod((void*)p); > assert(nm != nullptr, "did not find nmethod"); Right, I created a patch to do exactly this and based it upon your PR branch. Take a look at my commit in: https://github.com/stefank/jdk/tree/rewire_nmethod_oop_loads I can try to get this upstreamed after I've done enough testing. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1944723712 From rgiulietti at openjdk.org Thu Feb 6 14:09:13 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Thu, 6 Feb 2025 14:09:13 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. test/jdk/jdk/internal/util/ByteArray/Types.java line 27: > 25: * @test > 26: * @bug 8349503 > 27: * @library /test/lib Suggestion: * @library /test/lib * @key randomness test/jdk/jdk/internal/util/ByteArray/Types.java line 64: > 62: new ReadCase<>("u2", ByteArray::getUnsignedShortBO, ByteArray::getUnsignedShortBE, ByteArray::getUnsignedShortLE, 2, u2 -> ((u2 >> Byte.SIZE) & 0xFF) | ((u2 << Byte.SIZE) & 0xFF00), Comparator.naturalOrder()), > 63: new ReadCase<>("int", ByteArray::getIntBO, ByteArray::getIntBE, ByteArray::getIntLE, Integer.BYTES, Integer::reverseBytes, Comparator.naturalOrder()), > 64: new ReadCase<>("float", ByteArray::getFloatBO, ByteArray::getFloatBE, ByteArray::getFloatLE, Float.BYTES, null, Comparator.comparing(Float::floatToRawIntBits)), Would it be possible to have a local `reverseBytes` for `float` and `double` as well? test/jdk/jdk/internal/util/ByteArray/Types.java line 124: > 122: new WriteCase<>("int", ByteArray::setIntBO, ByteArray::setIntBE, ByteArray::setIntLE, Integer.BYTES, List.of(42)), > 123: new WriteCase<>("float", ByteArray::setFloatBO, ByteArray::setFloatBE, ByteArray::setFloatLE, Float.BYTES, List.of(Float.NaN, Float.intBitsToFloat(0x7FF23847))), > 124: new WriteCase<>("float raw", ByteArray::setFloatRawBO, ByteArray::setFloatRawBE, ByteArray::setFloatRawLE, Float.BYTES, List.of(1.0F)), Raw seems to be exercised only on unproblematic cases, not on NaNs. Similarly for "double raw". test/jdk/jdk/internal/util/ByteArray/Types.java line 152: > 150: assertThrows(IndexOutOfBoundsException.class, () -> leWriter.set(arr, arrayLen - size + 1, value)); > 151: > 152: int index = 0; This is always 0. test/jdk/jdk/internal/util/ByteArray/Types.java line 173: > 171: var arrBe1 = arr.clone(); > 172: beWriter.set(arrBe1, index, v1); > 173: assertArrayEquals(arrBe, arrBe1); What about the little-endian case? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944742266 PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944786100 PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944783068 PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944769063 PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944772926 From thartmann at openjdk.org Thu Feb 6 14:26:11 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 6 Feb 2025 14:26:11 GMT Subject: RFR: 8346777: Add missing const declarations and rename variables [v2] In-Reply-To: References: Message-ID: <7uE_c_7deeXzcB1m5q6-Kf1j_wJvJRLBjSsQfTRtbE4=.93b2e3d0-7347-448b-944f-f990c8db1245@github.com> On Wed, 5 Feb 2025 08:26:42 GMT, Christian Hagedorn wrote: >> This simple patch adds some missing `const` and applies variable renamings and parameter reorderings >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > update Marked as reviewed by thartmann (Reviewer). Looks good. ------------- PR Review: https://git.openjdk.org/jdk/pull/23434#pullrequestreview-2598832317 PR Comment: https://git.openjdk.org/jdk/pull/23434#issuecomment-2639968867 From pminborg at openjdk.org Thu Feb 6 14:28:11 2025 From: pminborg at openjdk.org (Per Minborg) Date: Thu, 6 Feb 2025 14:28:11 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. src/java.base/share/classes/jdk/internal/util/ByteArray.java line 57: > 55: } > 56: > 57: public static short getShortBO(byte[] array, int index, boolean big) { If we have methods `getShortBE` and `getShortLE` then perhaps this method should just be called `getShort`. src/java.base/share/classes/jdk/internal/util/ByteArray.java line 62: > 60: } > 61: > 62: public static int getIntBO(byte[] array, int index, boolean big) { I suggest to rename `big` to `bigEndian` to use the same naming conventions as in `Unsafe`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944819768 PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944817837 From psandoz at openjdk.org Thu Feb 6 14:33:11 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Thu, 6 Feb 2025 14:33:11 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. This looks reasonable, not looked in detail at all of it. Although i understand the motivation can you please revert the changes to VarHandle. VarHandle is intended to be as thin as possible safe wrapper around unsafe. This PR changes how this is for one particular kind of handle thus differing from the others, and it also creates an indirection with a potential independently moving part. Also, some guidance in Java on ByteArray would be useful on when to use it e.g., enumerating the cases such as early execution of the JVM etc. ------------- PR Review: https://git.openjdk.org/jdk/pull/23478#pullrequestreview-2598852813 From pminborg at openjdk.org Thu Feb 6 14:33:12 2025 From: pminborg at openjdk.org (Per Minborg) Date: Thu, 6 Feb 2025 14:33:12 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. src/java.base/share/classes/jdk/internal/util/ByteArray.java line 131: > 129: // BE methods > 130: > 131: public static char getCharBE(byte[] array, int offset) { I think it is worth making the effort to document these methods as they are used across the JDK. We could just take the docs from the old class and modify it slightly. src/java.desktop/share/classes/javax/imageio/stream/ImageInputStreamImpl.java line 245: > 243: throw new EOFException(); > 244: } > 245: return (byteOrder == ByteOrder.BIG_ENDIAN) This could just be `ByteArray.getShortBO(byteBuff, 0, byteOrder == ByteOrder.BIG_ENDIAN)`. Same for the others. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944825706 PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944828604 From pminborg at openjdk.org Thu Feb 6 14:36:14 2025 From: pminborg at openjdk.org (Per Minborg) Date: Thu, 6 Feb 2025 14:36:14 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. test/jdk/jdk/internal/util/ByteArray/Types.java line 82: > 80: byte[] arr = new byte[arrayLen]; > 81: > 82: assertThrows(NullPointerException.class, () -> orderedReader.get(null, 0, true)); I suggest breaking out the invariant tests in a separate test. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1944834882 From rriggs at openjdk.org Thu Feb 6 14:39:11 2025 From: rriggs at openjdk.org (Roger Riggs) Date: Thu, 6 Feb 2025 14:39:11 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: <0DK9LJKJvNPIUaWuV8y8U6t6W0YmmVG7VqyZlpr_q2Y=.b7e801fa-070a-40ce-ac66-a266544b6425@github.com> On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. It would have been useful to get agreement on the concept and the naming before committing to the implementation. The BE/BO/LE are noise in the API. The little endian cases are a minority and should attract more attention in the API. The network byte-order/big-endian cases should keep the simple names. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23478#issuecomment-2640004192 From chagedorn at openjdk.org Thu Feb 6 14:52:14 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 6 Feb 2025 14:52:14 GMT Subject: RFR: 8346777: Add missing const declarations and rename variables [v2] In-Reply-To: References: Message-ID: <9sO_PVVYIJMPtAjJ4ey2rbku5N9Vr6jwxGlH-LRV9h0=.cd57877a-c22f-47bf-8f1c-b4af5fc86ca4@github.com> On Wed, 5 Feb 2025 08:26:42 GMT, Christian Hagedorn wrote: >> This simple patch adds some missing `const` and applies variable renamings and parameter reorderings >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > update Thanks Tobias! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23434#issuecomment-2640038059 From chagedorn at openjdk.org Thu Feb 6 14:52:15 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 6 Feb 2025 14:52:15 GMT Subject: Integrated: 8346777: Add missing const declarations and rename variables In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 09:43:12 GMT, Christian Hagedorn wrote: > This simple patch adds some missing `const` and applies variable renamings and parameter reorderings > > Thanks, > Christian This pull request has now been integrated. Changeset: e0487c7c Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/e0487c7cbc16fdfe26d22f2b6e65bca7d4398252 Stats: 65 lines in 2 files changed: 1 ins; 0 del; 64 mod 8346777: Add missing const declarations and rename variables Reviewed-by: epeter, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/23434 From mli at openjdk.org Thu Feb 6 16:03:21 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 6 Feb 2025 16:03:21 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison Message-ID: Hi, Can you help to review the patch? It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. Thanks ### Performance -COH-AvoidUnalignedAccesses ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 128 | N/A | avgt | 10 | 32078969.29 | 29532674.32 | 26556.277 | ns/op | 0.086 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 256 | N/A | avgt | 10 | 60265969.7 | 56586871.09 | 137156.033 | ns/op | 0.065 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 512 | N/A | avgt | 10 | 116892685.5 | 111266086.7 | 106542.956 | ns/op | 0.051 +COH-AvoidUnalignedAccesses "+COH-Avoid" | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6384145.1 | 6386249.278 | 1018.737 | ns/op | 0 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9484748.389 | 9437921.504 | 25184.962 | ns/op | 0.005 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18648808.89 | 17610658.36 | 3613.01 | ns/op | 0.059 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 31130870.72 | 30542644.4 | 136527.281 | ns/op | 0.019 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58943258.88 | 59678929.13 | 9247.37 | ns/op | -0.012 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115515196.6 | 118941406.1 | 30104.94 | ns/op | -0.029 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7640264.164 | 7691120.134 | 8941.255 | ns/op | -0.007 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10568117.17 | 10540680.16 | 11225.744 | ns/op | 0.003 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19727550.52 | 18982109.53 | 8609.764 | ns/op | 0.039 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 128 | N/A | avgt | 10 | 32125836.11 | 31006763.12 | 13915.357 | ns/op | 0.036 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 256 | N/A | avgt | 10 | 60202330.54 | 59561746.17 | 15817.383 | ns/op | 0.011 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 512 | N/A | avgt | 10 | 116636087.8 | 117264697.5 | 19632.92 | ns/op | -0.005 -COH+AvoidUnalignedAccesses ?-COH+Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 7885819.391 | 7886391.597 | 1238.502 | ns/op | 0 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 10889126.89 | 10891175.03 | 1399.179 | ns/op | 0 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 16897052.9 | 16774972.05 | 2609.977 | ns/op | 0.007 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 28922269.1 | 28478100.28 | 66442.357 | ns/op | 0.016 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 56180868.95 | 55517886.54 | 48061.692 | ns/op | 0.012 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 110907025.6 | 110201411.8 | 65541.488 | ns/op | 0.006 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 9160226.336 | 9053922.422 | 30162.696 | ns/op | 0.012 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 11858619.51 | 11875124.68 | 30883.529 | ns/op | -0.001 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 18145983.4 | 17975971.55 | 42989.862 | ns/op | 0.009 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 128 | N/A | avgt | 10 | 30239261.15 | 29550149 | 93379.637 | ns/op | 0.023 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 256 | N/A | avgt | 10 | 57340601.37 | 56598661.49 | 63235.889 | ns/op | 0.013 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 512 | N/A | avgt | 10 | 111847228.7 | 111826244.4 | 30457.378 | ns/op | 0 +COH+AvoidUnalignedAccesses ?+COH+Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 8824614.771 | 8879669.558 | 1136.78 | ns/op | -0.006 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 12453700.16 | 12452977.94 | 2412.16 | ns/op | 0 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 24742973.85 | 24723425.63 | 46396.337 | ns/op | 0.001 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 39673249.81 | 39681763.05 | 10340.791 | ns/op | 0 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 74004446.14 | 73731397.09 | 29337.751 | ns/op | 0.004 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 142806141.5 | 142275494.2 | 52159.737 | ns/op | 0.004 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 9935969.372 | 10045152.69 | 70155.123 | ns/op | -0.011 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 13255244.36 | 13280685.46 | 68940.937 | ns/op | -0.002 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 25848901.97 | 25887363.77 | 16178.629 | ns/op | -0.001 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 128 | N/A | avgt | 10 | 40859613.95 | 40908567.84 | 54621.931 | ns/op | -0.001 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 256 | N/A | avgt | 10 | 74853691.65 | 74878469.62 | 31639.587 | ns/op | 0 com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 512 | N/A | avgt | 10 | 143431358.1 | 143495718.8 | 60271.214 | ns/op | 0 ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/23495/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23495&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349556 Stats: 20 lines in 2 files changed: 4 ins; 15 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23495.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23495/head:pull/23495 PR: https://git.openjdk.org/jdk/pull/23495 From kvn at openjdk.org Thu Feb 6 16:39:11 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Feb 2025 16:39:11 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v4] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 15:23:07 GMT, Aleksey Shipilev wrote: >> We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. >> >> I think we need to run CTW in the mode that exposes more code to the compiler optimizations. >> >> Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. >> >> Additional testing: >> - [x] Linux x86-64 server fastdebug, `applications/ctw/modules` >> - [x] Linux AArch64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Also do markMethodProfiled for extra scope > - Fix I submitted testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2640361059 From coleenp at openjdk.org Thu Feb 6 17:19:23 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 17:19:23 GMT Subject: RFR: 8349559: ci doesn't need to store protection domain Message-ID: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> The compiler interface has a protection_domain field that it uses for matching in its version of not-yet loaded classes, but class loading only uses (class, class-loader) as an identifier for loaded classes so the compiler interface should do the same. From the code, I can't see any situation where the protection_domain wouldn't match if name and class loader match. Tested with tier1-7. ------------- Commit messages: - 8349559: ci doesn't need to store protection domain Changes: https://git.openjdk.org/jdk/pull/23496/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23496&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349559 Stats: 41 lines in 5 files changed: 0 ins; 33 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23496.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23496/head:pull/23496 PR: https://git.openjdk.org/jdk/pull/23496 From jkarthikeyan at openjdk.org Thu Feb 6 17:35:57 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 6 Feb 2025 17:35:57 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v2] In-Reply-To: References: Message-ID: > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Implement widening and address comments from review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23413/files - new: https://git.openjdk.org/jdk/pull/23413/files/6108c5d1..3b5447f6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=00-01 Stats: 149 lines in 6 files changed: 122 ins; 3 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/23413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23413/head:pull/23413 PR: https://git.openjdk.org/jdk/pull/23413 From jkarthikeyan at openjdk.org Thu Feb 6 17:35:57 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 6 Feb 2025 17:35:57 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 18:30:41 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Implement widening and address comments from review > > Great work, looks generally amazing ? > > I left a few comments below. @eme64 I've pushed a new version that implements widening, as well as modifies the unit tests to use the new `Generators` framework. I've also filed [JDK-8349562](https://bugs.openjdk.org/browse/JDK-8349562) as followup work for `char` types. Let me know what you think! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2640547848 From jbhateja at openjdk.org Thu Feb 6 17:49:54 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 6 Feb 2025 17:49:54 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v3] In-Reply-To: References: Message-ID: > Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature. > Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations. > > Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence. > > Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction. > > Following are the performance numbers of the following existing microbenchmark > https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java > > Patch passes following validation test > [test/jdk/java/lang/Math/IeeeRecommendedTests.java > ](https://github.com/openjdk/jdk/blob/master/test/jdk/java/lang/Math/IeeeRecommendedTests.java) > > > Granite Rapids-AP (P-core Xeon) > Baseline AVX512: > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 1296.141 ops/ns > Signum._7_copySignDoubleTest thrpt 2 838.954 ops/ns > > Withopt : > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 940.240 ops/ns > Signum._7_copySignDoubleTest thrpt 2 967.370 ops/ns > > Baseline AVX2: > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 63.673 ops/ns > Signum._7_copySignDoubleTest thrpt 2 26.898 ops/ns > > Withopt : > Benchmark Mode Cnt Score Error Units > Signum._5_copySignFloatTest thrpt 2 785.801 ops/ns > Signum._7_copySignDoubleTest thrpt 2 558.710 ops/ns > > Sierra Forest (E-core Xeon) > Baseline: > Benchmark (seed) Mode Cnt Score Error Units > o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 40.528 ops/ns > o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 25.101 ops/ns > > Withopt: > Benchmark (seed) Mode Cnt Score Error Units > o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 676.101 ops/ns > o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 ... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Adding vector support along with some refactoring. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23386/files - new: https://git.openjdk.org/jdk/pull/23386/files/2181850d..a2548732 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23386&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23386&range=01-02 Stats: 216 lines in 10 files changed: 160 ins; 37 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/23386.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23386/head:pull/23386 PR: https://git.openjdk.org/jdk/pull/23386 From qamai at openjdk.org Thu Feb 6 19:11:58 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Feb 2025 19:11:58 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: References: Message-ID: > Hi, > > This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: > > // We are allowed to use the constant type only if cast succeeded > > But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. > > Please take a look and leave your reviews, thanks a lot. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'master' into loadklassctrl - format - clearer intention, revert formatting, add assert - remove always_see_exact_class - remove control input of LoadKlassNode ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23274/files - new: https://git.openjdk.org/jdk/pull/23274/files/175232a6..7c2b595b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23274&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23274&range=03-04 Stats: 34650 lines in 1350 files changed: 16246 ins; 10055 del; 8349 mod Patch: https://git.openjdk.org/jdk/pull/23274.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23274/head:pull/23274 PR: https://git.openjdk.org/jdk/pull/23274 From qamai at openjdk.org Thu Feb 6 19:15:13 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Feb 2025 19:15:13 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> On Wed, 5 Feb 2025 09:32:27 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Merge branch 'master' into loadklassctrl >> - format >> - clearer intention, revert formatting, add assert >> - remove always_see_exact_class >> - remove control input of LoadKlassNode > > Looks good, thanks for the explanations! > > I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok. > > But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;) @eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2640769617 From qamai at openjdk.org Thu Feb 6 19:17:03 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Feb 2025 19:17:03 GMT Subject: RFR: 8346836: C2: Introduce a way to verify the correctness of ConstraintCastNodes at runtime [v5] In-Reply-To: References: Message-ID: > Hi, > > This patch adds a develop flag `VerifyConstraintCasts`, which will verify the correctness of `CastIINode`s and `CastLLNode`s at runtime and crash the VM if the dynamic value lies outside the type value range. > > Please take a look, thanks a lot. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'master' into verifycast - better comments - move test to a new file, add block_comment - add tests - make VerifyConstraintCast uint, better debug info - Merge branch 'master' into verifycast - Introduce VerifyConstraintCasts ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22880/files - new: https://git.openjdk.org/jdk/pull/22880/files/7f2af65b..da854c1f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22880&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22880&range=03-04 Stats: 84692 lines in 4143 files changed: 34307 ins; 31257 del; 19128 mod Patch: https://git.openjdk.org/jdk/pull/22880.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22880/head:pull/22880 PR: https://git.openjdk.org/jdk/pull/22880 From qamai at openjdk.org Thu Feb 6 19:17:03 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Feb 2025 19:17:03 GMT Subject: RFR: 8346836: C2: Introduce a way to verify the correctness of ConstraintCastNodes at runtime In-Reply-To: References: Message-ID: On Fri, 17 Jan 2025 16:41:00 GMT, Vladimir Kozlov wrote: >> Hi, >> >> This patch adds a develop flag `VerifyConstraintCasts`, which will verify the correctness of `CastIINode`s and `CastLLNode`s at runtime and crash the VM if the dynamic value lies outside the type value range. >> >> Please take a look, thanks a lot. > > We can add this flag to our stress testing sets of flags to make sure we run with it during our regular testing. @vnkozlov I don't have an AArch64 machine so I feel less confident writing one. We can add an AArch64 implementation later, though. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22880#issuecomment-2640767413 From jkarthikeyan at openjdk.org Thu Feb 6 19:50:55 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 6 Feb 2025 19:50:55 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v3] In-Reply-To: References: Message-ID: > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Fix some tests that now vectorize ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23413/files - new: https://git.openjdk.org/jdk/pull/23413/files/3b5447f6..cf75b269 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=01-02 Stats: 9 lines in 2 files changed: 1 ins; 2 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23413/head:pull/23413 PR: https://git.openjdk.org/jdk/pull/23413 From kvn at openjdk.org Thu Feb 6 20:05:11 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Feb 2025 20:05:11 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v4] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 15:23:07 GMT, Aleksey Shipilev wrote: >> We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. >> >> I think we need to run CTW in the mode that exposes more code to the compiler optimizations. >> >> Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. >> >> Additional testing: >> - [x] Linux x86-64 server fastdebug, `applications/ctw/modules` >> - [x] Linux AArch64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Also do markMethodProfiled for extra scope > - Fix My tier1-4,xcomp,stress testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23296#pullrequestreview-2599769188 From shade at openjdk.org Thu Feb 6 20:23:18 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 6 Feb 2025 20:23:18 GMT Subject: Integrated: 8348570: CTW: Expose the code hidden by uncommon traps In-Reply-To: References: Message-ID: On Fri, 24 Jan 2025 11:15:33 GMT, Aleksey Shipilev wrote: > We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. > > I think we need to run CTW in the mode that exposes more code to the compiler optimizations. > > Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. > > Additional testing: > - [x] Linux x86-64 server fastdebug, `applications/ctw/modules` > - [x] Linux AArch64 server fastdebug, `applications/ctw/modules` This pull request has now been integrated. Changeset: 10791477 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/10791477cf0a0a31d2703fc718a7a649d494d534 Stats: 22 lines in 2 files changed: 20 ins; 0 del; 2 mod 8348570: CTW: Expose the code hidden by uncommon traps Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/23296 From shade at openjdk.org Thu Feb 6 20:23:17 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 6 Feb 2025 20:23:17 GMT Subject: RFR: 8348570: CTW: Expose the code hidden by uncommon traps [v4] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 15:23:07 GMT, Aleksey Shipilev wrote: >> We have been looking at some related compiler behaviors, and realized that in the absence of profiling data, C2 routinely uncommon-traps a lot of code that is presumed to be never executed. This apparently is a norm in CTW tests: CTW runners never execute code, and so only the most basic java.base classes are having any profile. This seems to limit the scope of CTW testing. >> >> I think we need to run CTW in the mode that exposes more code to the compiler optimizations. >> >> Case in point: [JDK-8348572](https://bugs.openjdk.org/browse/JDK-8348572), which reliably fails with more aggressive compilation mode. >> >> Additional testing: >> - [x] Linux x86-64 server fastdebug, `applications/ctw/modules` >> - [x] Linux AArch64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Merge branch 'master' into JDK-8348570-ctw-uncommon-traps > - Also do markMethodProfiled for extra scope > - Fix About half of my CTW pipeline passed without problems. So it does not look there are major breakages with it. I'll integrate and see what happens in weekly testing. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23296#issuecomment-2640911603 From sparasa at openjdk.org Thu Feb 6 20:32:01 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 6 Feb 2025 20:32:01 GMT Subject: RFR: 8349582: APX NDD code generation for OpenJDK Message-ID: The goal of this PR is to generate code using APX NDD instructions. ------------- Commit messages: - 8349582: APX NDD code generation for OpenJDK Changes: https://git.openjdk.org/jdk/pull/23501/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23501&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349582 Stats: 3815 lines in 5 files changed: 2181 ins; 58 del; 1576 mod Patch: https://git.openjdk.org/jdk/pull/23501.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23501/head:pull/23501 PR: https://git.openjdk.org/jdk/pull/23501 From vlivanov at openjdk.org Thu Feb 6 21:53:09 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 6 Feb 2025 21:53:09 GMT Subject: RFR: 8349559: ci doesn't need to store protection domain In-Reply-To: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> References: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> Message-ID: On Thu, 6 Feb 2025 17:14:44 GMT, Coleen Phillimore wrote: > The compiler interface has a protection_domain field that it uses for matching in its version of not-yet loaded classes, but class loading only uses (class, class-loader) as an identifier for loaded classes so the compiler interface should do the same. From the code, I can't see any situation where the protection_domain wouldn't match if name and class loader match. > Tested with tier1-7. Nice cleanup! Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23496#pullrequestreview-2600079707 From kxu at openjdk.org Thu Feb 6 23:34:46 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Thu, 6 Feb 2025 23:34:46 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value Message-ID: [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) The following was implemented to address this issue. if (UseNewCode2) { *multiplier = bt == T_INT ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows : ((jlong) 1) << con->get_int(); } else { *multiplier = ((jlong) 1 << con->get_int()); } Two new bitshift overflow tests were added. ------------- Commit messages: - remove UseNewCode - Merge branch 'master' into arithmetic-canonicalization - fix serial addition regression - remove trailing empty comments - fix comment grammar - remove matching power-of-2 subtractions since it's already handled by Identity() - verify results with custom test methods - update comments to be more descriptive, remove unused can_reshape argument - update comments, use explicit opcode comparisons for LShift nodes - extract pattern matching to separate functions - ... and 35 more: https://git.openjdk.org/jdk/compare/a0c7f661...92100991 Changes: https://git.openjdk.org/jdk/pull/23506/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8347555 Stats: 455 lines in 3 files changed: 455 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23506.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23506/head:pull/23506 PR: https://git.openjdk.org/jdk/pull/23506 From dlong at openjdk.org Fri Feb 7 00:20:10 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 7 Feb 2025 00:20:10 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 23:29:51 GMT, Kangcheng Xu wrote: > [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. > > When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) > > The following was implemented to address this issue. > > if (UseNewCode2) { > *multiplier = bt == T_INT > ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows > : ((jlong) 1) << con->get_int(); > } else { > *multiplier = ((jlong) 1 << con->get_int()); > } > > > Two new bitshift overflow tests were added. It's not clear what result you are expecting from the new shift code when it overflows. It looks like signed overflow undefined behavior (UB) to me, which would ideally be caught by UBSAN, so you probably want to make sure your changes are ubsan-clean. If the desired result for overflow is 0, then I think java_shift_left() should be used. ------------- Changes requested by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23506#pullrequestreview-2600434103 From epeter at openjdk.org Fri Feb 7 07:07:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Feb 2025 07:07:14 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 19:11:58 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: >> >> // We are allowed to use the constant type only if cast succeeded >> >> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. >> >> Please take a look and leave your reviews, thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into loadklassctrl > - format > - clearer intention, revert formatting, add assert > - remove always_see_exact_class > - remove control input of LoadKlassNode Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2600950916 From epeter at openjdk.org Fri Feb 7 07:07:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Feb 2025 07:07:15 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> Message-ID: On Thu, 6 Feb 2025 19:12:26 GMT, Quan Anh Mai wrote: >> Looks good, thanks for the explanations! >> >> I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok. >> >> But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;) > > @eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much. @merykitty Testing launched! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2642097456 From epeter at openjdk.org Fri Feb 7 07:23:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Feb 2025 07:23:17 GMT Subject: RFR: 8341293: Split field loads through Nested Phis [v7] In-Reply-To: <18TQt6vxN9KxSVwyeQtAWde-ezaVuUEioAl_5_3sAeE=.e5e76fb6-04a7-4f6f-9377-f1e64837ada6@github.com> References: <18TQt6vxN9KxSVwyeQtAWde-ezaVuUEioAl_5_3sAeE=.e5e76fb6-04a7-4f6f-9377-f1e64837ada6@github.com> Message-ID: On Fri, 24 Jan 2025 19:13:13 GMT, Dhamoder Nalla wrote: >> As an extension of the work done as part of https://github.com/openjdk/jdk/pull/12897, split the field loads (AddP -> Load*) with nested phi parent nodes to enable more scalar replacements, thereby reducing memory allocation. >> >> >> Here are the sequence of Ideal graph transformations for Nested phi: >> >> >> ![image](https://github.com/user-attachments/assets/c18e5ca0-c554-475c-814a-7cb288d96569) >> >> ![image](https://github.com/user-attachments/assets/b279b5f2-9ec6-4d9b-a627-506451f1cf81) >> >> ![image](https://github.com/user-attachments/assets/f506b918-2dd0-4dbe-a440-ff253afa3961) >> >> JMH results: >> with disabled RAM >> >> Benchmark Mode Cnt Score Error Units >> NestedPhiAndRematerialize.NopRAM.testBailOut_runner avgt 15 13.969 ? 0.248 ms/op >> NestedPhiAndRematerialize.NopRAM.testFieldEscapeWithMerge_runner avgt 15 80.300 ? 4.306 ms/op >> NestedPhiAndRematerialize.NopRAM.testMerge_TryCatchFinally_runner avgt 15 72.182 ? 1.781 ms/op >> NestedPhiAndRematerialize.NopRAM.testMultiParentPhi_runner avgt 15 2.983 ? 0.001 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhiPolymorphic_runner avgt 15 18.342 ? 0.731 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhiProcessOrder_runner avgt 15 14.315 ? 0.443 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhiWithLambda_runner avgt 15 18.511 ? 1.212 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhiWithTrap_runner avgt 15 66.277 ? 1.478 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhi_FieldLoad_runner avgt 15 17.968 ? 0.306 ms/op >> NestedPhiAndRematerialize.NopRAM.testNestedPhi_TryCatch_runner avgt 15 14.186 ? 0.247 ms/op >> NestedPhiAndRematerialize.NopRAM.testRematerialize_MultiObj_runner avgt 15 88.435 ? 4.869 ms/op >> NestedPhiAndRematerialize.NopRAM.testRematerialize_SingleObj_runner avgt 15 29560.130 ? 48.797 ms/op >> NestedPhiAndRematerialize.NopRAM.testRematerialize_TryCatch_runner avgt 15 49.150 ? 2.307 ms/op >> NestedPhiAndRematerialize.NopRAM.testThreeLevelNestedPhi_runner avgt 15 18.236 ? 0.308 ms/op >> >> with enabled RAM >> Benchmark Mode Cnt Score Error Units >> NestedPhiAndRematerialize.YesRAM.testBailOut_runner avgt 15 3.257 ? 0.423 ms/op >> NestedPhiAndRematerialize.YesRAM.testFieldEscapeWithMerge_runner avgt 15 79.916 ? 3.477 ms/op >> NestedPhiAndRematerialize.YesRAM.testMerge_TryCatchFinally_runner avgt 15 72.053 ? 1.916 ms/op >> NestedPhiAndRematerialize.YesRAM.testMultiParentPhi_runner avgt 15 2.984 ? 0.001 ms/op >> NestedPhiAndRematerialize.YesRAM.testNestedPhiPolymorphic_runner avgt ... > > Dhamoder Nalla has updated the pull request incrementally with one additional commit since the last revision: > > Modify IR rules We're a little limited on reviewers, so sorry this is taking a while to review. I quickly scanned the code, and left some code-style comments. I also launched some testing. Can you please update your PR description, and add some more background info about the code you a remodifying? That would greatly help us review, since we would have to spend less time reading your code and the existing code. Can you please also format the benchmark results in a code block, so it is easier to read? src/hotspot/share/opto/escape.cpp line 1313: > 1311: // Ensure that the splits are applied to the load fields of child phi nodes before the parent phi nodes take place. > 1312: for (uint i = 0; i < nested_phis.size(); i++) { > 1313: Node *nested_phi = nested_phis.at(i); Suggestion: Node* nested_phi = nested_phis.at(i); Drive by comment src/hotspot/share/opto/memnode.cpp line 1661: > 1659: if (base->in(i)->is_Phi()) { > 1660: // base->in(i) is the parent phi node for base node. > 1661: Node *mem_node_for_load_after_opt = get_memory_node_for_nestedphi_after_split(phase, base, i); Suggestion: Node* mem_node_for_load_after_opt = get_memory_node_for_nestedphi_after_split(phase, base, i); Drive by style comment src/hotspot/share/opto/memnode.cpp line 1676: > 1674: // Note that this function doesn't actually perform the split. > 1675: // If a split is impossible, it returns nullptr. > 1676: Node* LoadNode::get_memory_node_for_nestedphi_after_split(PhaseGVN* phase, Node *base, uint parent_idx) { Suggestion: Node* LoadNode::get_memory_node_for_nested_phi_after_split(PhaseGVN* phase, Node* base, uint parent_idx) { Drive-by style comment test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/AllocationMergesNestedPhiTests.java line 81: > 79: "-XX:CompileCommand=exclude,*::dummy*"); > 80: > 81: framework.addScenarios(scenario0, scenario1, scenario2).start(); Would it make sense to add a scenario without any flags, or at least only sith the CompileCommand flags only that restrict compilation / inlining? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21270#pullrequestreview-2600962969 PR Review Comment: https://git.openjdk.org/jdk/pull/21270#discussion_r1946054121 PR Review Comment: https://git.openjdk.org/jdk/pull/21270#discussion_r1946054632 PR Review Comment: https://git.openjdk.org/jdk/pull/21270#discussion_r1946055102 PR Review Comment: https://git.openjdk.org/jdk/pull/21270#discussion_r1946056790 From tschatzl at openjdk.org Fri Feb 7 08:42:16 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 7 Feb 2025 08:42:16 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v5] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> Message-ID: On Thu, 6 Feb 2025 08:49:28 GMT, Roberto Casta?eda Lozano wrote: >> G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. >> >> The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: >> >> >> o = new MyObject(); >> if (...) { >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the if condition) >> } >> >> >> or in initialization writes placed after exception-throwing checks: >> >> >> o = new MyObject(); >> if (...) { >> throw new Exception(""); >> } >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the above if condition) >> >> >> These patterns are commonly found in Java code, e.g. in the core libraries: >> >> - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or >> >> - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). >> >> The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): >> >> >> Object[] a = new Object[...]; >> for (int i = 0; i < a.length; i++) { >> a[i] = ...; // barrier elided only after this changeset >> } >> >> >> or eliding barriers from array initialization writes with unknown array index: >> >> >> Object[] a = new Object[...]; >> a[index] = ...; // barrier elided only after this changeset >> >> >> The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_inde... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright headers Afaict this is good. ------------- Marked as reviewed by tschatzl (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23235#pullrequestreview-2601121948 From roland at openjdk.org Fri Feb 7 08:45:34 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 08:45:34 GMT Subject: RFR: 8341976: C2: use_mem_state != load->find_exact_control(load->in(0)) assert failure [v2] In-Reply-To: References: Message-ID: > The `arraycopy` writes to a non escaping array so its `ArrayCopy` node > is marked as having a narrow memory effect. One of the loads from the > destination after the copy is transformed into a load from the source > array (the rationale being that if there's no load from the > destination of the copy, the `arraycopy` is not needed). The load from > the source has the input memory state of the `ArrayCopy` as memory > input. That load is then sunk out of the loop and its control is > updated to be after the `ArrayCopy`. That's legal because the > `ArrayCopy` only has a narrow memory effect and can't modify the > source. The `ArrayCopy` can't be eliminated and is expanded. In the > process, a `MemBar` that has a wide memory effect is added. The load > from the source has control after the membar but memory state before > and because the membar has a wide memory effect, the load is anti > dependent on the membar: the graph is broken (the load can't be pinned > after the membar and anti dependent on it). > > In short, the problem is that the graph is transformed under the > assumption that the `ArrayCopy` has a narrow effect but the > `ArrayCopy` is expanded to a subgraph that has a wide memory > effect. The fix I propose is to not insert a membar with a wide memory > effect. We still need a membar when the destination is non escaping > because the expanded `ArrayCopy`, if it writes to a tighly allocated > array, writes to raw memory and not to the destination memory slice. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23465/files - new: https://git.openjdk.org/jdk/pull/23465/files/20f0fb62..06ee02de Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23465&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23465&range=00-01 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23465.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23465/head:pull/23465 PR: https://git.openjdk.org/jdk/pull/23465 From roland at openjdk.org Fri Feb 7 08:45:35 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 08:45:35 GMT Subject: RFR: 8341976: C2: use_mem_state != load->find_exact_control(load->in(0)) assert failure [v2] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 11:28:05 GMT, Galder Zamarre?o wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > src/hotspot/share/opto/macroArrayCopy.cpp line 831: > >> 829: insert_mem_bar(ctrl, &out_mem, Op_MemBarStoreStore, Compile::AliasIdxBot); >> 830: } else { >> 831: int alias_idx = Compile::AliasIdxBot; > > Minor thing, `alias_idx` is already defined in the method with a different type. Would it make sense to use a different name here? > > Earlier definition: > > uint alias_idx = C->get_alias_index(adr_type); Good catch. I renamed `alias_idx`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23465#discussion_r1946160388 From roland at openjdk.org Fri Feb 7 08:47:50 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 08:47:50 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v3] In-Reply-To: References: Message-ID: > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/type.cpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/a1225f74..f50d46ab Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From roland at openjdk.org Fri Feb 7 08:52:47 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 08:52:47 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v4] In-Reply-To: References: Message-ID: > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/f50d46ab..2281946e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From rcastanedalo at openjdk.org Fri Feb 7 09:19:13 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 7 Feb 2025 09:19:13 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v5] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> Message-ID: <6WMofkASYawj1iolPRb1_3GIgpjJ_5ggK-nnnMXdYII=.58aff13a-49b3-430f-a37e-c2dea123bd97@github.com> On Fri, 7 Feb 2025 08:40:03 GMT, Thomas Schatzl wrote: > Afaict this is good. Thanks for reviewing, Thomas! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2642378742 From jbhateja at openjdk.org Fri Feb 7 09:23:13 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 7 Feb 2025 09:23:13 GMT Subject: RFR: 8349582: APX NDD code generation for OpenJDK In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 20:26:49 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to generate code using APX NDD instructions. Some initial comments. src/hotspot/cpu/x86/x86_64.ad line 1555: > 1553: } else { > 1554: return (offset < 0x80) ? 5 : 8; // REX > 1555: } Please move this out to a separate patch. src/hotspot/cpu/x86/x86_64.ad line 5804: > 5802: %} > 5803: > 5804: instruct countTrailingZerosL_mem_nf(rRegI dst, memory src, rFlagsReg cr) %{ Why are we using rflagsReg operand for Non-Flags affecting patterns? src/hotspot/cpu/x86/x86_64.ad line 5809: > 5807: effect(KILL cr); > 5808: > 5809: ins_cost(175); Could you please let me know why you're using ins_cost here and not in above pattern? src/hotspot/cpu/x86/x86_64.ad line 7322: > 7320: ins_encode %{ > 7321: __ eaddq($dst$$Register, $src1$$Register, $src2$$Register, false); > 7322: %} Current scheme is favoring emission of NDD instruction on APX enabled targets, even if destination and source registers belongs legacy GPR set. We should extend assembler layer to demote EEVEX to REX encoding if dst matches with source operands. FYI, processor backend also uses Move Elimination to prevent dispatch of GPR to GPR moves to execution ports. This can be used to further break NDD pattens with all different legacy register operands, GPR to GPR moves should consume no more than 3 bytes encoding 1 byte for REX prefix, 1 for opcode and 1 for ModRM byte, still less than 4 byte prefix. src/hotspot/cpu/x86/x86_64.ad line 7359: > 7357: effect(KILL cr); > 7358: > 7359: format %{ "eaddq $dst, $src1, $src2\t# long ndd" %} Can you kindly share the impact of adding all these new patterns on libjvm.so size ? I am curious to know the amount of ADLC generated code for all these new patterns. ------------- PR Review: https://git.openjdk.org/jdk/pull/23501#pullrequestreview-2601192411 PR Review Comment: https://git.openjdk.org/jdk/pull/23501#discussion_r1946208661 PR Review Comment: https://git.openjdk.org/jdk/pull/23501#discussion_r1946212191 PR Review Comment: https://git.openjdk.org/jdk/pull/23501#discussion_r1946216069 PR Review Comment: https://git.openjdk.org/jdk/pull/23501#discussion_r1946202682 PR Review Comment: https://git.openjdk.org/jdk/pull/23501#discussion_r1946203337 From rcastanedalo at openjdk.org Fri Feb 7 09:24:13 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 7 Feb 2025 09:24:13 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> Message-ID: On Wed, 5 Feb 2025 17:51:36 GMT, Amit Kumar wrote: > I see TestG1BarrierGeneration.java failure :( > > [TestG1BarrierGeneration_jtr.log](https://github.com/user-attachments/files/18676532/TestG1BarrierGeneration_jtr.log) @offamitkumar thanks for the report! Most likely the test failures are only due to missing optimizations (because of limitations in the barrier elision pattern matching analysis), but if you want me to confirm please send the entire jtreg log, without truncation. You can disable output truncation running the test like this: `make run-test TEST="compiler/gcbarriers/TestG1BarrierGeneration.java" JTREG="MAX_OUTPUT=999999999"` Please double-check that the output log file does not contain any `Output overflow` message. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2642388571 From roland at openjdk.org Fri Feb 7 09:30:25 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 09:30:25 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v5] In-Reply-To: References: Message-ID: > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/2281946e..b8f1cf6d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=03-04 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From fyang at openjdk.org Fri Feb 7 09:34:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 7 Feb 2025 09:34:10 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison In-Reply-To: References: Message-ID: <5flniWfQCgZhzTfYjUcwL27w2JZnj1h-IUldnvk0y_w=.fe2e5770-7ba7-406f-b5dd-38dfdceec1c5@github.com> On Thu, 6 Feb 2025 15:59:16 GMT, Hamlin Li wrote: > Hi, > > Can you help to review the patch? > > It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). > The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. > > Thanks > > ### Performance > > -COH-AvoidUnalignedAccesses > > ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 128 | N/A | avgt ... src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2528: > 2526: tmpL = isLU ? tmp1 : tmp2; // where to keep L for comparison > 2527: > 2528: if (AvoidUnalignedAccesses && (base_offset1 % 8) != 0) { I find that a similar check is in `C2_MacroAssembler::string_compare` for the UU/LL cases [1]. Seems more consistent if we move it into the counterpart `generate_compare_long_string_same_encoding`. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1443 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23495#discussion_r1946234489 From roland at openjdk.org Fri Feb 7 09:37:49 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 09:37:49 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v6] In-Reply-To: References: Message-ID: > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: - Update src/hotspot/share/opto/mulnode.hpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/b8f1cf6d..8bac2cab Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=04-05 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From roland at openjdk.org Fri Feb 7 09:52:51 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 09:52:51 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v7] In-Reply-To: References: Message-ID: <3cT_HJ9dj5J4NFrLzmvYUdUy4uee6Ltcm6d20YP3jm0=.aa20c25e-c097-4e59-9d82-12aa2c3b4422@github.com> > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/8bac2cab..e4053783 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=05-06 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From roland at openjdk.org Fri Feb 7 09:57:10 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 09:57:10 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v2] In-Reply-To: <1kjZnYmjzNrXuXFsPlpCY_LAPHEQz30i_RpDmr3Xh80=.307d3d09-c4d2-4211-8e0b-5e8beb3b8f3c@github.com> References: <1kjZnYmjzNrXuXFsPlpCY_LAPHEQz30i_RpDmr3Xh80=.307d3d09-c4d2-4211-8e0b-5e8beb3b8f3c@github.com> Message-ID: On Wed, 5 Feb 2025 15:45:08 GMT, Jasmine Karthikeyan wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > src/hotspot/share/utilities/globalDefinitions.hpp line 799: > >> 797: return BitsPerJavaInteger; >> 798: } >> 799: return BitsPerJavaLong; > > I think it'd be nice to add `assert(bt == T_LONG, "unsupported");` before the last return, like in the helper methods above. Indeed. Added in new commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1946266019 From roland at openjdk.org Fri Feb 7 10:06:10 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 10:06:10 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v2] In-Reply-To: <1kjZnYmjzNrXuXFsPlpCY_LAPHEQz30i_RpDmr3Xh80=.307d3d09-c4d2-4211-8e0b-5e8beb3b8f3c@github.com> References: <1kjZnYmjzNrXuXFsPlpCY_LAPHEQz30i_RpDmr3Xh80=.307d3d09-c4d2-4211-8e0b-5e8beb3b8f3c@github.com> Message-ID: <2I5BKquTjwv3_pAmR-YQs-N3KmMbJ0MgszHuYO6AUsk=.5df8d0a3-aa19-4433-a276-473411b7c5a2@github.com> On Wed, 5 Feb 2025 16:00:08 GMT, Jasmine Karthikeyan wrote: > This is really nice! I'd wondered why there was no `RShiftL::Ideal`, and it's nice to have it handled it in a generic way with the integer version. I left mostly code style comments here. Thanks for reviewing this. I pushed a new commit that takes your comments into account. > src/hotspot/share/opto/mulnode.cpp line 1399: > >> 1397: assert(lo <= hi, "must have valid bounds"); >> 1398: #ifdef ASSERT >> 1399: if (bt ==T_INT) { > > Suggestion: > > if (bt == T_INT) { > > Could this assert be generic to also handle T_LONG too? The assert checks that, for the int case: long lo; assert((int)(lo >> shift) == (((int)lo) >> shift, ""); For long, it would be: long lo; assert((long)(lo >> shift) == (((long)lo) >> shift, ""); Given everything is already a long, that's: long lo; assert(lo >> shift == lo >> shift, ""); ------------- PR Comment: https://git.openjdk.org/jdk/pull/23438#issuecomment-2642478546 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1946277125 From mli at openjdk.org Fri Feb 7 10:40:02 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 7 Feb 2025 10:40:02 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison In-Reply-To: <5flniWfQCgZhzTfYjUcwL27w2JZnj1h-IUldnvk0y_w=.fe2e5770-7ba7-406f-b5dd-38dfdceec1c5@github.com> References: <5flniWfQCgZhzTfYjUcwL27w2JZnj1h-IUldnvk0y_w=.fe2e5770-7ba7-406f-b5dd-38dfdceec1c5@github.com> Message-ID: <2Kl4-7sAM7HLA7wj90Vh0gl_FIzZTqN8n8YR9VR5SLI=.fc68f783-b760-44de-af81-4e494dcc06ba@github.com> On Fri, 7 Feb 2025 09:31:26 GMT, Fei Yang wrote: >> Hi, >> >> Can you help to review the patch? >> >> It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). >> The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. >> >> Thanks >> >> ### Performance >> >> -COH-AvoidUnalignedAccesses >> >> ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 >> com.arm.benchmarks.intrinsics.StringCompareToD... > > src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2528: > >> 2526: tmpL = isLU ? tmp1 : tmp2; // where to keep L for comparison >> 2527: >> 2528: if (AvoidUnalignedAccesses && (base_offset1 % 8) != 0) { > > I find that a similar check is in `C2_MacroAssembler::string_compare` for the UU/LL cases [1]. > Seems more consistent if we move it into the counterpart `generate_compare_long_string_same_encoding`. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1443 Agree, we should refactor the code a bit to make it more readable. As it seems just a refactor, so I can do it in another pr, how do you think about it? At the same time I can also clean the invocation of `compare_string_16_bytes_same` from `generate_compare_long_string_same_encoding`, I don't like the implicit registers passing between them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23495#discussion_r1946313359 From fyang at openjdk.org Fri Feb 7 11:11:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 7 Feb 2025 11:11:10 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 15:59:16 GMT, Hamlin Li wrote: > Hi, > > Can you help to review the patch? > > It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). > The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. > > Thanks > > ### Performance > > it's run on bananapi. > > -COH-AvoidUnalignedAccesses > > ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL ... Marked as reviewed by fyang (Reviewer). src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2529: > 2527: > 2528: if (AvoidUnalignedAccesses && (base_offset1 % 8) != 0) { > 2529: // Load another 4 bytes from strL to make sure main loop is 8-byte aligned Nit: You might want to update this code comment removing the word `another`. `// Load another 4 bytes from strL to make sure main loop is 8-byte aligned` => `// Load 4 bytes from strL to make sure main loop is 8-byte aligned` ------------- PR Review: https://git.openjdk.org/jdk/pull/23495#pullrequestreview-2601459963 PR Review Comment: https://git.openjdk.org/jdk/pull/23495#discussion_r1946363699 From fyang at openjdk.org Fri Feb 7 11:11:11 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 7 Feb 2025 11:11:11 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison In-Reply-To: <2Kl4-7sAM7HLA7wj90Vh0gl_FIzZTqN8n8YR9VR5SLI=.fc68f783-b760-44de-af81-4e494dcc06ba@github.com> References: <5flniWfQCgZhzTfYjUcwL27w2JZnj1h-IUldnvk0y_w=.fe2e5770-7ba7-406f-b5dd-38dfdceec1c5@github.com> <2Kl4-7sAM7HLA7wj90Vh0gl_FIzZTqN8n8YR9VR5SLI=.fc68f783-b760-44de-af81-4e494dcc06ba@github.com> Message-ID: On Fri, 7 Feb 2025 10:29:24 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2528: >> >>> 2526: tmpL = isLU ? tmp1 : tmp2; // where to keep L for comparison >>> 2527: >>> 2528: if (AvoidUnalignedAccesses && (base_offset1 % 8) != 0) { >> >> I find that a similar check is in `C2_MacroAssembler::string_compare` for the UU/LL cases [1]. >> Seems more consistent if we move it into the counterpart `generate_compare_long_string_same_encoding`. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1443 > > Agree, we should refactor the code a bit to make it more readable. > As it seems just a refactor, so I can do it in another pr, how do you think about it? > At the same time I can also clean the invocation of `compare_string_16_bytes_same` from `generate_compare_long_string_same_encoding`, I don't like the implicit registers passing between them. Sure. Seems the LL/UU cases are kind of different as they already emit direct 8-byte loads before the stub. So not sure if it's doable to move the check. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23495#discussion_r1946363295 From amitkumar at openjdk.org Fri Feb 7 12:02:22 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 7 Feb 2025 12:02:22 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> Message-ID: On Fri, 7 Feb 2025 09:21:39 GMT, Roberto Casta?eda Lozano wrote: >> I see TestG1BarrierGeneration.java failure :( >> >> [TestG1BarrierGeneration_jtr.log](https://github.com/user-attachments/files/18676532/TestG1BarrierGeneration_jtr.log) > >> I see TestG1BarrierGeneration.java failure :( >> >> [TestG1BarrierGeneration_jtr.log](https://github.com/user-attachments/files/18676532/TestG1BarrierGeneration_jtr.log) > > @offamitkumar thanks for the report! Most likely the test failures are only due to missing optimizations (because of limitations in the barrier elision pattern matching analysis), but if you want me to confirm please send the entire jtreg log, without truncation. You can disable output truncation running the test like this: > `make run-test TEST="compiler/gcbarriers/TestG1BarrierGeneration.java" JTREG="MAX_OUTPUT=999999999"` > Please double-check that the output log file does not contain any `Output overflow` message. @robcasloz Sure: I can spend time on it, maybe on weekend, for now I am overloaded with some other tasks. [TestG1BarrierGeneration_jtr_no_overflow.log](https://github.com/user-attachments/files/18706090/TestG1BarrierGeneration_jtr_no_overflow.log) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2642733177 From mli at openjdk.org Fri Feb 7 12:29:33 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 7 Feb 2025 12:29:33 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: > Hi, > > Can you help to review the patch? > > It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). > The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. > > Thanks > > ### Performance > > it's run on bananapi. > > -COH-AvoidUnalignedAccesses > > ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL ... Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: refine comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23495/files - new: https://git.openjdk.org/jdk/pull/23495/files/964cec90..82fd643b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23495&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23495&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23495.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23495/head:pull/23495 PR: https://git.openjdk.org/jdk/pull/23495 From mli at openjdk.org Fri Feb 7 12:29:34 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 7 Feb 2025 12:29:34 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 11:08:41 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> refine comments > > src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2529: > >> 2527: >> 2528: if (AvoidUnalignedAccesses && (base_offset1 % 8) != 0) { >> 2529: // Load another 4 bytes from strL to make sure main loop is 8-byte aligned > > Nit: You might want to update this code comment removing the word `another`. > `// Load another 4 bytes from strL to make sure main loop is 8-byte aligned` > => > `// Load 4 bytes from strL to make sure main loop is 8-byte aligned` Yes, fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23495#discussion_r1946444809 From fyang at openjdk.org Fri Feb 7 12:46:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 7 Feb 2025 12:46:10 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 12:29:33 GMT, Hamlin Li wrote: >> Hi, >> >> Can you help to review the patch? >> >> It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). >> The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. >> >> Thanks >> >> ### Performance >> >> it's run on bananapi. >> >> -COH-AvoidUnalignedAccesses >> >> ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 >> com.arm.benchmarks.... > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > refine comments Thanks for the update! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23495#pullrequestreview-2601638866 From roland at openjdk.org Fri Feb 7 13:27:00 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 13:27:00 GMT Subject: RFR: 8349479: C2: when a Type node becomes dead, make CFG path that uses it unreachable [v2] In-Reply-To: References: Message-ID: > This is primarily motivated by 8275202 (C2: optimize out more > redundant conditions). In the following code snippet: > > > int[] array = new int[arraySize]; > if (j <= arraySize) { > if (i >= 0) { > if (i < j) { > int v = array[i]; > > > (`arraySize` is a constant) > > at the range check, `j` is known to be in `[min, arraySize]` as a > consequence, `i` is known to be `[0, arraySize-1]`. The range check > can be eliminated. > > Now, if later, `i` constant folds to some value that's positive but > out of range for the array: > > - if that happens when the new pass runs, then it can prove that: > > if (i < j) { > > is never taken. > > - if that happens during IGVN or CCP however, that condition is not > constant folded. And because the range check was removed, there's no > guard protecting the range check `CastII`. It becomes `top` and, as > a result, the graph can become broken. > > What I propose here is that when the `CastII` becomes dead, any CFG > paths that use the `CastII` node is made unreachable. So in pseudo code: > > > int[] array = new int[arraySize]; > if (j <= arraySize) { > if (i >= 0) { > if (i < j) { > halt(); > > > Finding the CFG paths is implemented in the patch by following the > uses of the node until a CFG node or a `Phi` is encountered. > > The patch applies this to all `Type` nodes as with 8275202, I also ran > in some rare corner cases with other types of nodes. The exception is > `Phi` nodes which may not be as easy to handle (and for which I had no > issue with 8275202). > > Finally, the patch includes a test case that's unrelated to the > discussion of 8275202 above. In that test case, a `CastII` becomes top > but the test that guards it doesn't constant fold. The root cause is a > transformation of: > > > (CastII (AddI > > > into > > > (AddI (CastII ) (CastII)` > > > which causes the resulting node to have a wider type. The `CastII` > captures a type before the transformation above happens. Once it has > happened, the guard for the `CastII` can't be constant folded when an > out of bound value occurs. > > This is likely fixable some other way (eventhough it doesn't seem > straightforward). Given the long history of similar issues (and the > test case that shows that they are more hiding), I think it would > make sense to try some other way of approaching them. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23468/files - new: https://git.openjdk.org/jdk/pull/23468/files/ad734844..73dd6d84 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23468&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23468&range=00-01 Stats: 22 lines in 2 files changed: 10 ins; 8 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23468.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23468/head:pull/23468 PR: https://git.openjdk.org/jdk/pull/23468 From roland at openjdk.org Fri Feb 7 13:27:00 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 13:27:00 GMT Subject: RFR: 8349479: C2: when a Type node becomes dead, make CFG path that uses it unreachable [v2] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 11:44:35 GMT, Galder Zamarre?o wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > src/hotspot/share/opto/node.cpp line 3076: > >> 3074: assert(r->is_Region() || r->is_top(), "unexpected Phi's control"); >> 3075: if (r->is_Region()) { >> 3076: for (uint k = 1; k < u->req(); ++k) { > > `k` already defined as `DUIterator_Fast` earlier, can we choose a different name? Good catch. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23468#discussion_r1946518884 From roland at openjdk.org Fri Feb 7 13:28:15 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Feb 2025 13:28:15 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v9] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 06:39:33 GMT, Tobias Hartmann wrote: > Ah, right, I see that you already mentioned that above. Should we then problem list the test with this change? Testing looks clean otherwise. https://github.com/openjdk/jdk/pull/23465 is a fix for JDK-8341976 and given it's much simpler than this change, I suppose it will get in first. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-2642917708 From mli at openjdk.org Fri Feb 7 13:30:11 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 7 Feb 2025 13:30:11 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 12:43:17 GMT, Fei Yang wrote: > Thanks for the update! Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23495#issuecomment-2642927745 From jbhateja at openjdk.org Fri Feb 7 14:09:31 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 7 Feb 2025 14:09:31 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v24] In-Reply-To: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: > Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. > Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. > > Following are the performance stats for JMH micro included with the patch. > > > Granite Rapids (P-core Xeon Server) > Baseline : > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms > > Sierra Forest (E-core Xeon Server) > Baseline: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 3352.839 ... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review resolutions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22863/files - new: https://git.openjdk.org/jdk/pull/22863/files/f629a6f0..fd39a429 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22863&range=23 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22863&range=22-23 Stats: 8 lines in 2 files changed: 0 ins; 5 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/22863.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22863/head:pull/22863 PR: https://git.openjdk.org/jdk/pull/22863 From jbhateja at openjdk.org Fri Feb 7 14:13:15 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 7 Feb 2025 14:13:15 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v23] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: On Wed, 5 Feb 2025 09:48:09 GMT, Emanuel Peter wrote: >> Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> Lowering feature check to IR annotation level > > src/hotspot/share/opto/vectornode.cpp line 1058: > >> 1056: if (is_predicated_vector()) { >> 1057: return false; >> 1058: } > > Hmm. Can you give me a concrete example of a masked operation that would be filtered out? > > Can it for example be a `AddVI`? But that only has 2 inputs for `VEC1` and `VEC2`. Where would the mask be located - and why does that not get us to `req() > 3`? > > Ah, I see it can be added in `VectorNode::try_to_gen_masked_vector`, with `add_req`, but then we should have `req() > 3`. > > Ok, this looks a bit complicated, but it looks like we are doing this. > > // Generate a vector mask for vector operation whose vector length is lower than the > // hardware supported max vector length. > > > Ok, fine. > > It could be good to add a comment here though, explaining why the operation seemingly has 3 inputs, but we don't exit at `req() != 3` above. Hi @eme64, Since we are selectively enabling commoning based on opcodes checks, hence skipping over their predicated counterparts should suffice. Please find below the code reference adding mask operands it targets supports perdicated instructions. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L495 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1946586266 From dlunden at openjdk.org Fri Feb 7 14:38:15 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 7 Feb 2025 14:38:15 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 07:44:22 GMT, Christian Hagedorn wrote: >> Interestingly, this is the line in `cfgnode.cpp` that blocks the MergeMem/Phi swap idealization: >> >> // This restriction is temporarily necessary to ensure termination: >> if (!saw_self && adr_type() == TypePtr::BOTTOM) merge_width = 0; >> >> If I comment out the line it solves all the failures we have seen. I double-checked that we then perform the exact MergeMem/Phi swap idealizations discussed above. >> >> I am wondering what the proper solution is here. I will, of course, investigate if it is possible to loosen the restriction and still ensure termination. On the other hand, it also seems strange that the anti-dependence search is so sensitive to missing idealizations? > > Thanks for having yet another look at this! > >> If I comment out the line it solves all the failures we have seen. I double-checked that we then perform the exact > MergeMem/Phi swap idealizations discussed above. > > That sounds promising! Looks like this temporary restriction became quite permanent - it's from initial load. I'm wondering if that is still necessary and if so if we have tests to catch that (we would probably hit the "infinite loop in IGVN" in that case). > >> I am wondering what the proper solution is here. I will, of course, investigate if it is possible to loosen the restriction and still ensure termination. On the other hand, it also seems strange that the anti-dependence search is so sensitive to missing idealizations? > > That would be great if we can get around this termination issue somehow - if it's still a problem. I think that is very unfortunate that we might be relying on this Ideal transformation to be applied to ensure correctness later on. If it's really required, we should at least make sure to add some verification code to catch this in debug builds. You could, for example, just turn what you have now into verification code, i.e. check that we cannot find another anti dependency edge with another search root. And/Or re-apply this particular transformation for each Phi node again in the end to see if we missed some swaps. I have now investigated the `PhiNode::Ideal` restriction above. In summary, I have not found any simple change that resolves the present issue *and* does not introduce problems elsewhere. Here is what I have tried. Both changes solve the present issue. 1. Remove the restriction entirely. As the source code comment suggests, this results in non-termination (which is very easy to verify). The reason is that memory Phis are (naturally) often circular, and there are plenty of cases where we push MergeMems indefinitely across circular Phis. 2. Only apply the idealization if we can guarantee that we are not pushing MergeMems over Phis in a circular manner. I check this through a complete upwards walk of the memory graph from the current Phi to ensure we cannot reach it from itself. This is likely quite expensive and we can probably do something more clever. It kind of works, but there are still tests that fail. Even if we now do terminate, I suspect we still have a combinatorial explosion of new split Phi nodes in certain cases, because we hit the MemLimit in many of the failing tests. I can and probably will continue to investigate option 2, but it feels like that should be a separate RFE. I'm open to suggestions. > You could, for example, just turn what you have now into verification code, i.e. check that we cannot find another anti dependency edge with another search root. @chhagedorn Yes, I agree. If we want to enforce a memory graph invariant at the time of `insert_anti_dependences`, we should also assert that it holds as best we can. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1946628758 From dlunden at openjdk.org Fri Feb 7 14:43:15 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 7 Feb 2025 14:43:15 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v8] In-Reply-To: <6l8orDGDTI-ADWxEmDjMPX1uorIhxLd3T55s0eIzJ3I=.0cb9d2c8-4302-408f-b64e-dc9a8e3d4145@github.com> References: <6l8orDGDTI-ADWxEmDjMPX1uorIhxLd3T55s0eIzJ3I=.0cb9d2c8-4302-408f-b64e-dc9a8e3d4145@github.com> Message-ID: On Fri, 31 Jan 2025 15:33:29 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Fix trailing whitespace Keep alive ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2643125387 From rcastanedalo at openjdk.org Fri Feb 7 14:48:51 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 7 Feb 2025 14:48:51 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v6] In-Reply-To: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> Message-ID: <2jrzusvVl-XI8K734YlChq4ObRX75yovTq7mWTf8ZlA=.0e75781a-5d52-4919-ad28-c5e91ec3a47f@github.com> > G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. > > The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: > > > o = new MyObject(); > if (...) { > o.myField = ...; // barrier elided only after this changeset > // (assuming no safepoint in the if condition) > } > > > or in initialization writes placed after exception-throwing checks: > > > o = new MyObject(); > if (...) { > throw new Exception(""); > } > o.myField = ...; // barrier elided only after this changeset > // (assuming no safepoint in the above if condition) > > > These patterns are commonly found in Java code, e.g. in the core libraries: > > - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or > > - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). > > The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): > > > Object[] a = new Object[...]; > for (int i = 0; i < a.length; i++) { > a[i] = ...; // barrier elided only after this changeset > } > > > or eliding barriers from array initialization writes with unknown array index: > > > Object[] a = new Object[...]; > a[index] = ...; // barrier elided only after this changeset > > > The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_index`, `look_through_node`, `is_{undefined|unknown|concrete}`, `get_base_and_offset`, `is_array... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Disable test IR checks for cases where barrier elision analysis fails to elide on s390 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23235/files - new: https://git.openjdk.org/jdk/pull/23235/files/3671f474..956e0ac5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23235&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23235&range=04-05 Stats: 9 lines in 1 file changed: 9 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23235.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23235/head:pull/23235 PR: https://git.openjdk.org/jdk/pull/23235 From rcastanedalo at openjdk.org Fri Feb 7 14:55:13 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 7 Feb 2025 14:55:13 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> Message-ID: On Fri, 7 Feb 2025 09:21:39 GMT, Roberto Casta?eda Lozano wrote: >> I see TestG1BarrierGeneration.java failure :( >> >> [TestG1BarrierGeneration_jtr.log](https://github.com/user-attachments/files/18676532/TestG1BarrierGeneration_jtr.log) > >> I see TestG1BarrierGeneration.java failure :( >> >> [TestG1BarrierGeneration_jtr.log](https://github.com/user-attachments/files/18676532/TestG1BarrierGeneration_jtr.log) > > @offamitkumar thanks for the report! Most likely the test failures are only due to missing optimizations (because of limitations in the barrier elision pattern matching analysis), but if you want me to confirm please send the entire jtreg log, without truncation. You can disable output truncation running the test like this: > `make run-test TEST="compiler/gcbarriers/TestG1BarrierGeneration.java" JTREG="MAX_OUTPUT=999999999"` > Please double-check that the output log file does not contain any `Output overflow` message. > @robcasloz Sure: > > I can spend time on it, maybe on weekend, for now I am overloaded with some other tasks. > > [TestG1BarrierGeneration_jtr_no_overflow.log](https://github.com/user-attachments/files/18706090/TestG1BarrierGeneration_jtr_no_overflow.log) Thanks Amit, I had a look and the failures are indeed due to missing barrier elisions for atomic operations on newly created objects, which is suboptimal but safe (and in practice unlikely to make a noticeable performance difference). I just disabled IR checks for the two affected tests on s390 by now (commit 956e0ac5). The issue is likely due to limitations in the pattern matching logic of barrier elision, but I do not have the proper means to debug it on s390. If you find a solution before this changeset is fully reviewed, feel free to propose a patch and I will merge it into the changeset. Otherwise, it can always be done as follow-up work. Hope this works for you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2643162531 From mli at openjdk.org Fri Feb 7 15:02:47 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 7 Feb 2025 15:02:47 GMT Subject: RFR: 8349666: RISC-V: enable superwords tests for vector reductions Message-ID: Hi, Can you help to review this patch? On riscv, some vector reduction intrinsics were already implemented, but they are not verified indeed. This patch is to enable these tests on riscv. Thanks ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/23518/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23518&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349666 Stats: 73 lines in 9 files changed: 64 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23518.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23518/head:pull/23518 PR: https://git.openjdk.org/jdk/pull/23518 From rcastanedalo at openjdk.org Fri Feb 7 15:41:21 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 7 Feb 2025 15:41:21 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: <0VH5QcK_dbjMV_cVX7KBm8VljD0YWtJA19ERC6yliLo=.8ef6aa05-d6c6-451f-93b2-814262b1af99@github.com> On Fri, 7 Feb 2025 14:36:01 GMT, Daniel Lund?n wrote: > I have now investigated the PhiNode::Ideal restriction above. In summary, I have not found any simple change that resolves the present issue and does not introduce problems elsewhere. Great investigation, Daniel! Thanks for taking the effort to investigate this issue thoroughly. > I think that is very unfortunate that we might be relying on this Ideal transformation to be applied to ensure correctness later on. I agree, there are probably many more examples of hidden dependencies on idealization for correctness in C2, but it would be good to avoid introducing more of these if possible. @dlunde do we know at which point in the compilation chain the disjoint memory state invariant (that the above idealization restores) is broken? Would it be possible to do some analysis at that point to "simply" avoid producing the problematic memory subgraph in the first place? > @chhagedorn Yes, I agree. If we want to enforce a memory graph invariant at the time of insert_anti_dependences, we should also assert that it holds as best we can. I wonder if the core invariant we want to assert here is that two memory states with aliasing slices never overlap in time, after GCM and LCM are done. This could be checked by performing liveness analysis of the memory subgraph after GCM and LCM. This may sound expensive to compute but it could turn out to be acceptable in practice (for debug builds). The similarly expensive register-level liveness analysis in `PhaseOutput::perform_mach_node_analysis` takes no more than 1-2% of the entire C2 execution time, on average. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1946727644 From epeter at openjdk.org Fri Feb 7 16:39:27 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Feb 2025 16:39:27 GMT Subject: RFR: 8341293: Split field loads through Nested Phis [v7] In-Reply-To: <-Fs-Nim4P8TQMnjE9bs2HBY34vQtzhzH2dsU7MDlZrI=.34991658-bed4-46ac-b213-f4988c0f9c8b@github.com> References: <18TQt6vxN9KxSVwyeQtAWde-ezaVuUEioAl_5_3sAeE=.e5e76fb6-04a7-4f6f-9377-f1e64837ada6@github.com> <-Fs-Nim4P8TQMnjE9bs2HBY34vQtzhzH2dsU7MDlZrI=.34991658-bed4-46ac-b213-f4988c0f9c8b@github.com> Message-ID: On Wed, 5 Feb 2025 17:31:43 GMT, Dhamoder Nalla wrote: >> @dhanalla Would you like this to be reviewed? We generally don't re-review until we get pinged again. The idea is that you are maybe still working on it, and so there is no point in reviewing half-processed code. So once you are happy, you can let us know ;) > >> @dhanalla Would you like this to be reviewed? We generally don't re-review until we get pinged again. The idea is that you are maybe still working on it, and so there is no point in reviewing half-processed code. So once you are happy, you can let us know ;) > Thanks, @eme64 for checking with me. Yes, it's ready for review. @dhanalla Testing failed for this test: `compiler/c2/irTests/scalarReplacement/AllocationMergesNestedPhiTests.java` With flags: - `-server -Xcomp` - `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation` - `-XX:-TieredCompilation -XX:+AlwaysIncrementalInline` - `-XX:-TieredCompilation -XX:+StressReflectiveCode -XX:-ReduceInitialCardMarks -XX:-ReduceBulkZeroing -XX:-ReduceFieldZeroing` We also have an internal test that is failing with the same assert: `# assert(jobj != nullptr && jobj != phantom_obj) failed: escaped allocation` ------------- PR Comment: https://git.openjdk.org/jdk/pull/21270#issuecomment-2643428408 From epeter at openjdk.org Fri Feb 7 16:40:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Feb 2025 16:40:15 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> Message-ID: On Thu, 6 Feb 2025 19:12:26 GMT, Quan Anh Mai wrote: >> Looks good, thanks for the explanations! >> >> I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok. >> >> But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;) > > @eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much. @merykitty Testing is all passing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2643431552 From vlivanov at openjdk.org Fri Feb 7 17:25:22 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 7 Feb 2025 17:25:22 GMT Subject: RFR: 8346836: C2: Introduce a way to verify the correctness of ConstraintCastNodes at runtime [v5] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 19:17:03 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch adds a develop flag `VerifyConstraintCasts`, which will verify the correctness of `CastIINode`s and `CastLLNode`s at runtime and crash the VM if the dynamic value lies outside the type value range. >> >> Please take a look, thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Merge branch 'master' into verifycast > - better comments > - move test to a new file, add block_comment > - add tests > - make VerifyConstraintCast uint, better debug info > - Merge branch 'master' into verifycast > - Introduce VerifyConstraintCasts Very nice! src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 840: > 838: > 839: #ifdef ASSERT > 840: void C2_MacroAssembler::checked_cast_int(const TypeInt* type, Register dst) { Naming is a bit confusing here. It is a register which holds the value being range checked, not a register where new value is put. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 844: > 842: Label fail; > 843: Label succeed; > 844: cmpl(dst, type->_lo); Optimization idea: some range checks may be redundant (when `lo`/`hi` hold min/max values). src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 852: > 850: movl(rcx, type->_lo); > 851: movl(rdx, type->_hi); > 852: hlt(); // hlt so we have the stack trace That's interesting. Sounds like a problem in `NativeStackPrinter::print_stack()`. Speaking of debugging output, a call into a local helper function (encapsulating pretty printing logic) followed by a hlt call will do the job. But, considering the usages are in-line and quite common I suggest to make it conditional (guarded by a flag). It is possible to recover all 3 values from generated code if needed and turn on error reporting (specify the diagnostic flag) when reproducing failures. src/hotspot/cpu/x86/x86_64.ad line 7029: > 7027: %} > 7028: > 7029: instruct castLL_checked(rRegL dst, rRegL tmp, rFlagsReg cr) Optimization idea: considering the range is statically known, `tmp` is not needed when range boundaries fit into signed int. Worth adding an extra AD instruction to benefit from that. src/hotspot/share/opto/c2_globals.hpp line 666: > 664: "perform extra checks on the results of alias analysis") \ > 665: \ > 666: develop(uint, VerifyConstraintCasts, 0, \ Any downsides in making the flag diagnostic? It'll make it available in product builds. ------------- PR Review: https://git.openjdk.org/jdk/pull/22880#pullrequestreview-2602312430 PR Review Comment: https://git.openjdk.org/jdk/pull/22880#discussion_r1946868607 PR Review Comment: https://git.openjdk.org/jdk/pull/22880#discussion_r1946871722 PR Review Comment: https://git.openjdk.org/jdk/pull/22880#discussion_r1946882298 PR Review Comment: https://git.openjdk.org/jdk/pull/22880#discussion_r1946887184 PR Review Comment: https://git.openjdk.org/jdk/pull/22880#discussion_r1946881494 From bulasevich at openjdk.org Fri Feb 7 18:33:23 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 7 Feb 2025 18:33:23 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Mon, 3 Feb 2025 14:16:41 GMT, Andrew Dinn wrote: >> src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1422: >> >>> 1420: bool force_movk = true; // movk is important if the target can be more than 4GB away >>> 1421: adrp(dest, const_addr, offset, force_movk); >>> 1422: ldr(dest, Address(dest, offset)); >> >> I wonder if this really is the best way to do it. It's not clear to me that there is any advantage of using `adrp` in this case rather than a simple `mov(scratch, const_adr); ldr(dest, Address(scratch);`. The `mov` would produce `movz; movk; movk` which almost certainly execute in a single cycle, then a load without an offset, which is a single micro-op rather than two micro-ops for load+offset. All we've gained for this complication is a small reduction in code density rather than a performance improvement. I'd go with simplicity. > > Yes, I agree. > It's not clear to me that there is any advantage of using adrp I think ADP+MOVK is better both in terms of performance and code density.
Simple asm experiment shows that ADRP+MOVK performs better on my machine $ gcc test.cpp && ./a.out Allocated at: 0x10ffff0000 Elapsed time (movz-movk-movk): 3145591 Elapsed time (adrp-movk): 2739354 ============================================================ #include #include #include #include #include int main() { void *desired_addr = (void *)0x10ffff0000; size_t size = 4096; int32_t* data = (int32_t*)mmap(desired_addr, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED_NOREPLACE, -1, 0); if (data == MAP_FAILED) { perror("mmap failed"); return 1; } data[0] = 13; data[1] = 14; data[2] = 15; data[3] = 16; printf("Allocated at: %p\n", data); int32_t* ptr = (int32_t*)&main; int aa = 1; int bb = 2; int cc = 3; int dd = 4; clock_t start; start = clock(); for (int i=0; i<1000*1000*1000; i++) { asm ( "movz %0, 0x10, lsl #32; movk %0, 0xffff, lsl #16; movk %0, 0; ldr %0, [%0]; " "movz %1, 0x10, lsl #32; movk %1, 0xffff, lsl #16; movk %1, 4; ldr %1, [%1]; " "movz %2, 0x10, lsl #32; movk %2, 0xffff, lsl #16; movk %2, 8; ldr %2, [%2]; " "movz %3, 0x10, lsl #32; movk %3, 0xffff, lsl #16; movk %3, 12;ldr %3, [%3]; " : "=r" (aa), "=r" (bb), "=r" (cc), "=r" (dd) /* Output operands */ : : "cc"); } printf("Elapsed time (movz-movk-movk): %li\n", (clock() - start)); start = clock(); for (int i=0; i<1000*1000*1000; i++) { asm ( "adrp %0, main; movk %0, 0, lsl #32; ldr %0, [%0, 0x0];" "adrp %1, main; movk %1, 0, lsl #32; ldr %1, [%1, 0x4];" "adrp %2, main; movk %2, 0, lsl #32; ldr %2, [%2, 0x8];" "adrp %3, main; movk %3, 0, lsl #32; ldr %3, [%3, 0xc];" : "=r" (aa), "=r" (bb), "=r" (cc), "=r" (dd) /* Output operands */ : : "cc"); } printf("Elapsed time (adrp-movk): %li\n", (clock() - start)); munmap(data, size); return 0; }
The results are consistent with llvm-mca analysis:
llvm-mca analysis suggests that ADRP+MOVK performs better on Neoverse-N1 Neoverse-N1. ADRP-MOVK wins over MOVZ-MOVK-MOVK - Fewer instructions (300 vs 400) - Fewer total cycles (75 vs 109) - Faster execution - Lower instruction block throughput (0.7 vs 1.0) - More efficient execution - Less resource pressure - Less risk of pipeline stalls =================================================================================================== $ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-n1 asm1.S Iterations: 100 Instructions: 300 Total Cycles: 75 Total uOps: 300 Dispatch Width: 8 uOps Per Cycle: 4.00 IPC: 4.00 Block RThroughput: 0.7 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.33 adrp x0, target 1 1 0.33 movk x0, #39612, lsl #32 1 4 0.50 * ldr x1, [x0] Resources: [0] - N1UnitB [1.0] - N1UnitD [1.1] - N1UnitD [2.0] - N1UnitL [2.1] - N1UnitL [3] - N1UnitM [4.0] - N1UnitS [4.1] - N1UnitS [5] - N1UnitV0 [6] - N1UnitV1 Resource pressure per iteration: [0] [1.0] [1.1] [2.0] [2.1] [3] [4.0] [4.1] [5] [6] - - - 0.50 0.50 0.66 0.67 0.67 - - Resource pressure by instruction: [0] [1.0] [1.1] [2.0] [2.1] [3] [4.0] [4.1] [5] [6] Instructions: - - - - - 0.33 0.33 0.34 - - adrp x0, target - - - - - 0.33 0.34 0.33 - - movk x0, #39612, lsl #32 - - - 0.50 0.50 - - - - - ldr x1, [x0] =================================================================================================== $ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-n1 asm2.S Iterations: 100 Instructions: 400 Total Cycles: 109 Total uOps: 400 Dispatch Width: 8 uOps Per Cycle: 3.67 IPC: 3.67 Block RThroughput: 1.0 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.33 mov x0, #20014547599360 1 1 0.33 movk x0, #22136, lsl #16 1 1 0.33 movk x0, #39612 1 4 0.50 * ldr x1, [x0] Resources: [0] - N1UnitB [1.0] - N1UnitD [1.1] - N1UnitD [2.0] - N1UnitL [2.1] - N1UnitL [3] - N1UnitM [4.0] - N1UnitS [4.1] - N1UnitS [5] - N1UnitV0 [6] - N1UnitV1 Resource pressure per iteration: [0] [1.0] [1.1] [2.0] [2.1] [3] [4.0] [4.1] [5] [6] - - - 0.50 0.50 0.99 1.00 1.01 - - Resource pressure by instruction: [0] [1.0] [1.1] [2.0] [2.1] [3] [4.0] [4.1] [5] [6] Instructions: - - - - - - 0.66 0.34 - - mov x0, #20014547599360 - - - - - 0.33 0.34 0.33 - - movk x0, #22136, lsl #16 - - - - - 0.66 - 0.34 - - movk x0, #39612 - - - 0.50 0.50 - - - - - ldr x1, [x0]
llvm-mca analysis suggests that ADRP+MOVK performs better on Neoverse-V2 - Fewer instructions (300 vs 400) - Fewer total cycles (42 vs 59) - Faster execution - Lower instruction block throughput (0.3 vs 0.5) - More efficient execution - Less resource pressure - Less risk of pipeline stalls =================================================================================================================================================== $ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-v2 asm1.S Iterations: 100 Instructions: 300 Total Cycles: 42 Total uOps: 300 Dispatch Width: 16 uOps Per Cycle: 7.14 IPC: 7.14 Block RThroughput: 0.3 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 adrp x0, target 1 1 0.17 movk x0, #39612, lsl #32 1 4 0.33 * ldr x1, [x0] Resources: [0.0] - V2UnitB [0.1] - V2UnitB [1.0] - V2UnitD [1.1] - V2UnitD [2] - V2UnitL2 [3.0] - V2UnitL01 [3.1] - V2UnitL01 [4] - V2UnitM0 [5] - V2UnitM1 [6] - V2UnitS0 [7] - V2UnitS1 [8] - V2UnitS2 [9] - V2UnitS3 [10] - V2UnitV0 [11] - V2UnitV1 [12] - V2UnitV2 [13] - V2UnitV3 Resource pressure per iteration: [0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] - - - - 0.33 0.33 0.34 0.33 0.33 0.34 0.34 0.33 0.33 - - - - Resource pressure by instruction: [0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: - - - - - - - 0.33 0.33 0.17 0.17 - - - - - - adrp x0, target - - - - - - - - - 0.17 0.17 0.33 0.33 - - - - movk x0, #39612, lsl #32 - - - - 0.33 0.33 0.34 - - - - - - - - - - ldr x1, [x0] =================================================================================================================================================== $ ../clang+llvm-18.1.0-aarch64-linux-gnu/bin/llvm-mca -mcpu=neoverse-v2 asm2.S Iterations: 100 Instructions: 400 Total Cycles: 59 Total uOps: 400 Dispatch Width: 16 uOps Per Cycle: 6.78 IPC: 6.78 Block RThroughput: 0.5 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.17 mov x0, #20014547599360 1 1 0.17 movk x0, #22136, lsl #16 1 1 0.17 movk x0, #39612 1 4 0.33 * ldr x1, [x0] Resources: [0.0] - V2UnitB [0.1] - V2UnitB [1.0] - V2UnitD [1.1] - V2UnitD [2] - V2UnitL2 [3.0] - V2UnitL01 [3.1] - V2UnitL01 [4] - V2UnitM0 [5] - V2UnitM1 [6] - V2UnitS0 [7] - V2UnitS1 [8] - V2UnitS2 [9] - V2UnitS3 [10] - V2UnitV0 [11] - V2UnitV1 [12] - V2UnitV2 [13] - V2UnitV3 Resource pressure per iteration: [0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] - - - - 0.33 0.33 0.34 0.50 0.50 0.50 0.50 0.50 0.50 - - - - Resource pressure by instruction: [0.0] [0.1] [1.0] [1.1] [2] [3.0] [3.1] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: - - - - - - - - - 0.33 0.33 0.17 0.17 - - - - mov x0, #20014547599360 - - - - - - - 0.17 0.17 0.16 0.16 0.17 0.17 - - - - movk x0, #22136, lsl #16 - - - - - - - 0.33 0.33 0.01 0.01 0.16 0.16 - - - - movk x0, #39612 - - - - 0.33 0.33 0.34 - - - - - - - - - - ldr x1, [x0] ===================================================================================================================================================
------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1946989009 From never at openjdk.org Fri Feb 7 20:01:24 2025 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 7 Feb 2025 20:01:24 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v3] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 20:56:53 GMT, Tom Rodriguez wrote: >> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. > > Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: > > improve comments Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23444#issuecomment-2643985603 From never at openjdk.org Fri Feb 7 20:01:25 2025 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 7 Feb 2025 20:01:25 GMT Subject: Integrated: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez wrote: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. This pull request has now been integrated. Changeset: 7f6c6878 Author: Tom Rodriguez URL: https://git.openjdk.org/jdk/commit/7f6c687815031d99931265007ff8867bf964cb25 Stats: 14 lines in 1 file changed: 9 ins; 0 del; 5 mod 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash Reviewed-by: kvn, dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/23444 From iklam at openjdk.org Fri Feb 7 20:13:11 2025 From: iklam at openjdk.org (Ioi Lam) Date: Fri, 7 Feb 2025 20:13:11 GMT Subject: RFR: 8349559: Compiler interface doesn't need to store protection domain In-Reply-To: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> References: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> Message-ID: On Thu, 6 Feb 2025 17:14:44 GMT, Coleen Phillimore wrote: > The compiler interface has a protection_domain field that it uses for matching in its version of not-yet loaded classes, but class loading only uses (class, class-loader) as an identifier for loaded classes so the compiler interface should do the same. From the code, I can't see any situation where the protection_domain wouldn't match if name and class loader match. Actually I think the code was for this case: if you match (name, classLoader) with a calling class with the same classLoader and a different protectionDomain, the code should not match the class without going through the SystemDictionary to call checkPackageAccess() for the second protectionDomain. Since checkPackageAccess is now removed with the security manager, this extra lookup is now unnecessary and we match just class name, classLoader pairs. > > Tested with tier1-7. Looks fine to me. I don't know why the original code was caching the PD. name+loader is already unique. You can?t have two accessing_klasses with the same name+loader but different PDs. ------------- Marked as reviewed by iklam (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23496#pullrequestreview-2602736996 From coleenp at openjdk.org Fri Feb 7 21:32:13 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 7 Feb 2025 21:32:13 GMT Subject: RFR: 8349559: Compiler interface doesn't need to store protection domain In-Reply-To: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> References: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> Message-ID: On Thu, 6 Feb 2025 17:14:44 GMT, Coleen Phillimore wrote: > The compiler interface has a protection_domain field that it uses for matching in its version of not-yet loaded classes, but class loading only uses (class, class-loader) as an identifier for loaded classes so the compiler interface should do the same. From the code, I can't see any situation where the protection_domain wouldn't match if name and class loader match. Actually I think the code was for this case: if you match (name, classLoader) with a calling class with the same classLoader and a different protectionDomain, the code should not match the class without going through the SystemDictionary to call checkPackageAccess() for the second protectionDomain. Since checkPackageAccess is now removed with the security manager, this extra lookup is now unnecessary and we match just class name, classLoader pairs. > > Tested with tier1-7. Thanks Vladimir and Ioi for reviewing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23496#issuecomment-2644159403 From coleenp at openjdk.org Fri Feb 7 21:32:14 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 7 Feb 2025 21:32:14 GMT Subject: Integrated: 8349559: Compiler interface doesn't need to store protection domain In-Reply-To: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> References: <6rvXKu5uOti_qzs2SZ35wJy1Qq7XM7DY9lotXynoF9Y=.3ca45b44-d85b-4eff-a38f-83ff88a363b9@github.com> Message-ID: On Thu, 6 Feb 2025 17:14:44 GMT, Coleen Phillimore wrote: > The compiler interface has a protection_domain field that it uses for matching in its version of not-yet loaded classes, but class loading only uses (class, class-loader) as an identifier for loaded classes so the compiler interface should do the same. From the code, I can't see any situation where the protection_domain wouldn't match if name and class loader match. Actually I think the code was for this case: if you match (name, classLoader) with a calling class with the same classLoader and a different protectionDomain, the code should not match the class without going through the SystemDictionary to call checkPackageAccess() for the second protectionDomain. Since checkPackageAccess is now removed with the security manager, this extra lookup is now unnecessary and we match just class name, classLoader pairs. > > Tested with tier1-7. This pull request has now been integrated. Changeset: 1ed9ef1c Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/1ed9ef1c3f787b4075974d5dcfde1606d6bfbe86 Stats: 41 lines in 5 files changed: 0 ins; 33 del; 8 mod 8349559: Compiler interface doesn't need to store protection domain Reviewed-by: vlivanov, iklam ------------- PR: https://git.openjdk.org/jdk/pull/23496 From sparasa at openjdk.org Fri Feb 7 21:53:41 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 7 Feb 2025 21:53:41 GMT Subject: RFR: 8349582: APX NDD code generation for OpenJDK [v2] In-Reply-To: References: Message-ID: > The goal of this PR is to generate code using APX NDD instructions. > > **Please note:** I'm on vacation till March 3rd. Responses to the PR comments will be delayed until March 4th. Thank You for your understanding! Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: revert to nf version for {pop/tz/lz}cnt count instructions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23501/files - new: https://git.openjdk.org/jdk/pull/23501/files/e9794382..5306c39c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23501&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23501&range=00-01 Stats: 37 lines in 1 file changed: 0 ins; 13 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/23501.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23501/head:pull/23501 PR: https://git.openjdk.org/jdk/pull/23501 From sparasa at openjdk.org Fri Feb 7 21:53:41 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 7 Feb 2025 21:53:41 GMT Subject: RFR: 8349582: APX NDD code generation for OpenJDK [v2] In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 09:15:22 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> revert to nf version for {pop/tz/lz}cnt count instructions > > src/hotspot/cpu/x86/x86_64.ad line 5804: > >> 5802: %} >> 5803: >> 5804: instruct countTrailingZerosL_mem_nf(rRegI dst, memory src, rFlagsReg cr) %{ > > Why are we using rflagsReg operand for Non-Flags affecting patterns? Thanks for the catch! Please see the updated NF code for count instructions which reverted to the previous implementation which does not use the CR flags. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23501#discussion_r1947239459 From sparasa at openjdk.org Fri Feb 7 22:01:12 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 7 Feb 2025 22:01:12 GMT Subject: RFR: 8349582: APX NDD code generation for OpenJDK [v2] In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 09:12:43 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> revert to nf version for {pop/tz/lz}cnt count instructions > > src/hotspot/cpu/x86/x86_64.ad line 1555: > >> 1553: } else { >> 1554: return (offset < 0x80) ? 5 : 8; // REX >> 1555: } > > Please move this out to a separate patch. Could you please create the PR as you identified and fixed this issue? I will remove this code block once your updated BoxLock node fix is integrated into the mainline. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23501#discussion_r1947248354 From kvn at openjdk.org Fri Feb 7 22:08:28 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 7 Feb 2025 22:08:28 GMT Subject: RFR: 8346836: C2: Introduce a way to verify the correctness of ConstraintCastNodes at runtime In-Reply-To: References: Message-ID: On Fri, 17 Jan 2025 16:41:00 GMT, Vladimir Kozlov wrote: >> Hi, >> >> This patch adds a develop flag `VerifyConstraintCasts`, which will verify the correctness of `CastIINode`s and `CastLLNode`s at runtime and crash the VM if the dynamic value lies outside the type value range. >> >> Please take a look, thanks a lot. > > We can add this flag to our stress testing sets of flags to make sure we run with it during our regular testing. > @vnkozlov I don't have an AArch64 machine so I feel less confident writing one. We can add an AArch64 implementation later, though. What do you think? Okay, later is fine. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22880#issuecomment-2644215240 From qamai at openjdk.org Sat Feb 8 04:23:17 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 8 Feb 2025 04:23:17 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 19:11:58 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: >> >> // We are allowed to use the constant type only if cast succeeded >> >> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. >> >> Please take a look and leave your reviews, thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into loadklassctrl > - format > - clearer intention, revert formatting, add assert > - remove always_see_exact_class > - remove control input of LoadKlassNode Thanks a lot for your reviews and testing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2644491331 From qamai at openjdk.org Sat Feb 8 04:23:18 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 8 Feb 2025 04:23:18 GMT Subject: Integrated: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode In-Reply-To: References: Message-ID: On Thu, 23 Jan 2025 17:22:02 GMT, Quan Anh Mai wrote: > Hi, > > This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: > > // We are allowed to use the constant type only if cast succeeded > > But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. > > Please take a look and leave your reviews, thanks a lot. This pull request has now been integrated. Changeset: e9278de3 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/e9278de3f8676c288bfdce96f8348470e7c42900 Stats: 60 lines in 10 files changed: 5 ins; 18 del; 37 mod 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode Reviewed-by: vlivanov, epeter ------------- PR: https://git.openjdk.org/jdk/pull/23274 From aph-open at littlepinkcloud.com Sat Feb 8 10:38:23 2025 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Sat, 8 Feb 2025 10:38:23 +0000 Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: <96d307d2-136b-4cbd-9fd3-47e12e7afcc9@littlepinkcloud.com> On 2/7/25 18:33, Boris Ulasevich wrote: > I think ADP+MOVK is better both in terms of performance and code density. Good work, you may be right. Neoverse N1 has a fairly narrow (4 wide) decoder, so I guess it's more likely to be limited by instruction count. That benchmark isn't valid for GCC on my machine, because its outputs aren't used so GCC doesn't generate code for the asm. However, if we change the benchmark to actually *do something* with the data (simply add the results together) we get this for movz+movk on Apple M1: 2,332,615,983 cycles:u # 3.135 GHz (95.26%) 18,660,205,348 instructions:u # 8.00 insn per cycle (95.26%) and this for adrp+movk: 2,563,872,489 cycles:u # 3.057 GHz (96.03%) 14,357,197,644 instructions:u # 5.60 insn per cycle (96.03%) Here we can see that the M1 is totally front-end limited: 8 ipc is the speed of light on an M1. Nonetheless, the timings are similar, with the win going to movz+movk. On Neoverse V2, I also see an advantage for adrp+movk: 4162362189 cycles:u # 2.796 GHz 25002398111 instructions:u # 6.01 insn per cycle 3243420864 cycles:u # 2.796 GHz 21002398115 instructions:u # 6.48 insn per cycle So, looks like adrp+movk has an overall advantage. I'm still somewhat skeptical that this usage really deserves a reloc handler of its own, though, given the usage. If we do decide to do this, please give the forced movk version of adrp() a new name, and have adrp() call it. Having said all of that, I'm not sure why we're seeing such different instruction counts for Apple M1 and Neoverse V2, I guess it must be the compiler but I don't know why, so take all of this with a big pinch of salt. For this really to be valid I guess we'd have to use the exact same binaries. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph-open at littlepinkcloud.com Sat Feb 8 10:59:06 2025 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Sat, 8 Feb 2025 10:59:06 +0000 Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: <96d307d2-136b-4cbd-9fd3-47e12e7afcc9@littlepinkcloud.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <96d307d2-136b-4cbd-9fd3-47e12e7afcc9@littlepinkcloud.com> Message-ID: <7d43af45-8f33-49ea-a890-baa3d808d4d3@littlepinkcloud.com> On 2/8/25 10:38, Andrew Haley wrote: > On 2/7/25 18:33, Boris Ulasevich wrote: >> I think ADP+MOVK is better both in terms of performance and code density. > > Good work, you may be right. > > Neoverse N1 has a fairly narrow (4 wide) decoder, so I guess it's more likely > to be limited by instruction count. > > That benchmark isn't valid for GCC on my machine, because its outputs aren't > used so GCC doesn't generate code for the asm. > > However, if we change the benchmark to actually *do something* with the data > (simply add the results together) we get this for movz+movk on Apple M1: Sorry, I messed up: 2,916,015,951 cycles:u # 3.151 GHz (95.33%) 22,773,874,629 instructions:u # 7.81 insn per cycle (95.33%) adrp: 3,132,136,323 cycles:u # 3.114 GHz (96.60%) 18,420,464,246 instructions:u # 5.88 insn per cycle (96.60%) -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From vkempik at openjdk.org Sat Feb 8 12:03:15 2025 From: vkempik at openjdk.org (Vladimir Kempik) Date: Sat, 8 Feb 2025 12:03:15 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 12:29:33 GMT, Hamlin Li wrote: >> Hi, >> >> Can you help to review the patch? >> >> It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). >> The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. >> >> Thanks >> >> ### Performance >> >> it's run on bananapi. >> >> -COH-AvoidUnalignedAccesses >> >> ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 >> com.arm.benchmarks.... > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > refine comments hold on a bit, let me take a look at it after the weekend ------------- PR Comment: https://git.openjdk.org/jdk/pull/23495#issuecomment-2645236322 From duke at openjdk.org Sat Feb 8 14:55:51 2025 From: duke at openjdk.org (Matthias Ernst) Date: Sat, 8 Feb 2025 14:55:51 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v20] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request incrementally with two additional commits since the last revision: - Reword correctness. - Reword correctness. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/1d23c1a4..a7544441 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=19 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=18-19 Stats: 9 lines in 1 file changed: 1 ins; 2 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From duke at openjdk.org Sat Feb 8 14:59:07 2025 From: duke at openjdk.org (Matthias Ernst) Date: Sat, 8 Feb 2025 14:59:07 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v21] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 38 additional commits since the last revision: - Merge branch 'openjdk:master' into mernst/JDK-8346664 - Reword correctness. - Reword correctness. - Comments, "Proof", order of checks. - Apply suggestions from code review Co-authored-by: Emanuel Peter - jlong, not long - Merge branch 'openjdk:master' into mernst/JDK-8346664 - dropped bug ref. - indent - consistently label failing cases due to Align requirements. - ... and 28 more: https://git.openjdk.org/jdk/compare/0bddcce5...9783ce49 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/a7544441..9783ce49 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=20 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=19-20 Stats: 8045 lines in 259 files changed: 3109 ins; 3001 del; 1935 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From duke at openjdk.org Sat Feb 8 18:30:56 2025 From: duke at openjdk.org (Matthias Ernst) Date: Sat, 8 Feb 2025 18:30:56 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v22] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: Reword correctness (fixes). ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/9783ce49..09f01e80 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=20-21 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From jkarthikeyan at openjdk.org Sat Feb 8 21:58:10 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sat, 8 Feb 2025 21:58:10 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v2] In-Reply-To: <2I5BKquTjwv3_pAmR-YQs-N3KmMbJ0MgszHuYO6AUsk=.5df8d0a3-aa19-4433-a276-473411b7c5a2@github.com> References: <1kjZnYmjzNrXuXFsPlpCY_LAPHEQz30i_RpDmr3Xh80=.307d3d09-c4d2-4211-8e0b-5e8beb3b8f3c@github.com> <2I5BKquTjwv3_pAmR-YQs-N3KmMbJ0MgszHuYO6AUsk=.5df8d0a3-aa19-4433-a276-473411b7c5a2@github.com> Message-ID: On Fri, 7 Feb 2025 10:02:44 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/mulnode.cpp line 1399: >> >>> 1397: assert(lo <= hi, "must have valid bounds"); >>> 1398: #ifdef ASSERT >>> 1399: if (bt ==T_INT) { >> >> Suggestion: >> >> if (bt == T_INT) { >> >> Could this assert be generic to also handle T_LONG too? > > The assert checks that, for the int case: > > > long lo; > assert((int)(lo >> shift) == (((int)lo) >> shift, ""); > > For long, it would be: > > long lo; > assert((long)(lo >> shift) == (((long)lo) >> shift, ""); > > Given everything is already a long, that's: > > long lo; > assert(lo >> shift == lo >> shift, ""); Ah I see, thank you for the explanation! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1947962232 From jkarthikeyan at openjdk.org Sun Feb 9 06:03:05 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sun, 9 Feb 2025 06:03:05 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v3] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 19:50:55 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: >> >> >> Baseline Patch >> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement >> VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) >> VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) >> VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) >> >> >> I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Fix some tests that now vectorize I also updated the benchmark, and got these results: Baseline Patch Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement VectorSubword.byteToInt 1024 avgt 12 185.700 ? 0.798 ns/op 37.427 ? 0.276 ns/op (4.96x) VectorSubword.byteToShort 1024 avgt 12 240.737 ? 1.087 ns/op 23.094 ? 0.502 ns/op (10.42x) VectorSubword.intToByte 1024 avgt 12 181.680 ? 0.553 ns/op 49.873 ? 1.613 ns/op (3.64x) VectorSubword.intToShort 1024 avgt 12 176.256 ? 1.414 ns/op 43.933 ? 4.310 ns/op (4.01x) VectorSubword.shortToByte 1024 avgt 12 245.600 ? 6.217 ns/op 28.426 ? 0.649 ns/op (8.64x) VectorSubword.shortToInt 1024 avgt 12 178.364 ? 2.921 ns/op 34.140 ? 0.229 ns/op (5.22x) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2646084657 From jkarthikeyan at openjdk.org Sun Feb 9 06:03:03 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sun, 9 Feb 2025 06:03:03 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v4] In-Reply-To: References: Message-ID: <0iE7uPGSBpBdlgayY_gqBpuPay-XSpjMdaOuqdo-nhs=.1c7fa2cb-f1ea-4810-8fe6-2e0e6af7b8ac@github.com> > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Add new conversions to benchmark ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23413/files - new: https://git.openjdk.org/jdk/pull/23413/files/cf75b269..6daa8ace Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=02-03 Stats: 21 lines in 1 file changed: 21 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23413/head:pull/23413 PR: https://git.openjdk.org/jdk/pull/23413 From jkarthikeyan at openjdk.org Sun Feb 9 06:06:09 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sun, 9 Feb 2025 06:06:09 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v7] In-Reply-To: <3cT_HJ9dj5J4NFrLzmvYUdUy4uee6Ltcm6d20YP3jm0=.aa20c25e-c097-4e59-9d82-12aa2c3b4422@github.com> References: <3cT_HJ9dj5J4NFrLzmvYUdUy4uee6Ltcm6d20YP3jm0=.aa20c25e-c097-4e59-9d82-12aa2c3b4422@github.com> Message-ID: On Fri, 7 Feb 2025 09:52:51 GMT, Roland Westrelin wrote: >> This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and >> `Value` because the `int` and `long` versions are very similar and so >> there's no logic duplication. In the process, support for some extra >> transformations is added to `RShiftL`. I also added some new test >> cases. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Thanks for the update, it looks good! ------------- Marked as reviewed by jkarthikeyan (Committer). PR Review: https://git.openjdk.org/jdk/pull/23438#pullrequestreview-2604126801 From kvn at openjdk.org Sun Feb 9 19:43:29 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sun, 9 Feb 2025 19:43:29 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix Zero and Minimal VM builds ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/11abd5e7..dda20f0b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=00-01 Stats: 6 lines in 1 file changed: 4 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Sun Feb 9 19:43:29 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sun, 9 Feb 2025 19:43:29 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 17:45:30 GMT, Vladimir Kozlov wrote: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp @dougxc and @tkrodriguez, please look if it affects Graal. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2646553512 From duke at openjdk.org Mon Feb 10 01:44:24 2025 From: duke at openjdk.org (Nicole Xu) Date: Mon, 10 Feb 2025 01:44:24 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: <2wciGZERU8bFZReIK9wzaQEFGMBwbbMWfneJGWjTPKg=.afc9ea1d-860b-46b6-a243-a107330041a5@github.com> On Tue, 4 Feb 2025 18:55:25 GMT, Emanuel Peter wrote: >> Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: >> >> >> java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 >> >> >> The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. >> >> Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. >> >> Additionally, some defined but unused variables have been removed. > > Oh, the OCA-verify is still stuck. I'm sorry about that ? > I pinged my manager @TobiHartmann , he will reach out to see what's the issue. Thanks @eme64. The OCA finally passed! ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2646741800 From duke at openjdk.org Mon Feb 10 01:48:24 2025 From: duke at openjdk.org (Nicole Xu) Date: Mon, 10 Feb 2025 01:48:24 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Wed, 8 Jan 2025 09:04:47 GMT, Nicole Xu wrote: > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: > > > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > > > The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. > > Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. > > Additionally, some defined but unused variables have been removed. Hi @jatin-bhateja Would you help to review the patch and verify the changes? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2646745390 From fyang at openjdk.org Mon Feb 10 02:38:15 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 10 Feb 2025 02:38:15 GMT Subject: RFR: 8349666: RISC-V: enable superwords tests for vector reductions In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 14:58:13 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > On riscv, some vector reduction intrinsics were already implemented, but they are not verified indeed. This patch is to enable these tests on riscv. > > Thanks LGTM. Did you check ProdRed_*.java under the same directory?Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23518#pullrequestreview-2604584328 From cjplummer at openjdk.org Mon Feb 10 03:14:22 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Mon, 10 Feb 2025 03:14:22 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero and Minimal VM builds I almost wished I hadn't looked because there is a lot of SA CodeBlob support that could use some cleanup. Most notably I think most of the wrapper subclasses are not needed by SA, and could be served by one common class. See what I'm doing in #23456 for JavaThread subclasses. Wrapper classes don't need to be 1-to-1 with the class type they are wrapping. A single wrapper class type can handle any number of hotspot types. It usually just make more sense for them to be 1-to-1, but when they are trivial and the implementation is replicated across subtypes, just having one wrapper class implement them all can simplify things. The other thing I noticed is a lot of the subtypes seem to be doing some unnecessary things like registering Observers, and they all do something like the following: 44 Type type = db.lookupType("BufferBlob"); Even when it never references "type". I'm not suggesting you clean up any of this now, but just pointed it out. I might file an issue and try to clean it up myself at some point. I still need to take a closer look at the SA changes. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38: > 36: public class CodeCache { > 37: private static GrowableArray heapArray; > 38: private static VirtualConstructor virtualConstructor; What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. ------------- PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2604594200 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948335278 From cjplummer at openjdk.org Mon Feb 10 03:29:13 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Mon, 10 Feb 2025 03:29:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 02:47:58 GMT, Chris Plummer wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero and Minimal VM builds > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38: > >> 36: public class CodeCache { >> 37: private static GrowableArray heapArray; >> 38: private static VirtualConstructor virtualConstructor; > > What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. I think I found the answer. Since there is no longer a vtable, TypeDataBase.addressTypeIsEqualToType() will no longer work for Codeblobs. I was wondering if the lack of a vtable might have some negative impact. Glad to see you found a solution. I hope the lack of a vtable does not creep in elsewhere. I know it will have some negative impact on things like the "findpc" functionality, which will no longer be able to tell the user that an address points to a Codeblob instance. There's no test for this, but users might run across it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948352958 From jbhateja at openjdk.org Mon Feb 10 05:33:25 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Feb 2025 05:33:25 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fixing typos Hi @PaulSandoz , Kindly let us know if this is good for integration. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2646957788 From jbhateja at openjdk.org Mon Feb 10 05:36:15 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Feb 2025 05:36:15 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v18] In-Reply-To: <90MwDac7Q83dK8KDagHOst15xV-quGZKVE8n2tP9dsk=.351ed042-9a69-4186-b134-8c3cb6fef6cd@github.com> References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> <90MwDac7Q83dK8KDagHOst15xV-quGZKVE8n2tP9dsk=.351ed042-9a69-4186-b134-8c3cb6fef6cd@github.com> Message-ID: On Tue, 4 Feb 2025 19:20:05 GMT, Emanuel Peter wrote: >> Hi @eme64 , Kindly share the results of your test runs. > > @jatin-bhateja Tests look all good on my side. I'll make another pass in the next few days, and hopefully approve. Hi @eme64 , All comments addressed, looking forward to your approval. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2646961342 From mli at openjdk.org Mon Feb 10 09:12:14 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 10 Feb 2025 09:12:14 GMT Subject: RFR: 8349666: RISC-V: enable superwords tests for vector reductions In-Reply-To: References: Message-ID: On Mon, 10 Feb 2025 02:35:08 GMT, Fei Yang wrote: > LGTM. Did you check ProdRed_*.java under the same directory?Thanks. Thank you. I'm starting to work on https://github.com/openjdk/jdk/pull/19015 again, will modify the code accordingly later in that PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23518#issuecomment-2647366630 From luhenry at openjdk.org Mon Feb 10 11:01:29 2025 From: luhenry at openjdk.org (Ludovic Henry) Date: Mon, 10 Feb 2025 11:01:29 GMT Subject: RFR: 8349666: RISC-V: enable superwords tests for vector reductions In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 14:58:13 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > On riscv, some vector reduction intrinsics were already implemented, but they are not verified indeed. This patch is to enable these tests on riscv. > > Thanks Marked as reviewed by luhenry (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23518#pullrequestreview-2605401096 From dnsimon at openjdk.org Mon Feb 10 11:03:13 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 10 Feb 2025 11:03:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 19:36:28 GMT, Vladimir Kozlov wrote: > @dougxc and @tkrodriguez, please look if it affects Graal. I'm pretty sure JVMCI does not care about the virtual-ness of these C++ classes. Running tier9 in mach5 is a good way to be sure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2647642674 From vkempik at openjdk.org Mon Feb 10 11:03:11 2025 From: vkempik at openjdk.org (Vladimir Kempik) Date: Mon, 10 Feb 2025 11:03:11 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 12:29:33 GMT, Hamlin Li wrote: >> Hi, >> >> Can you help to review the patch? >> >> It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). >> The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. >> >> Thanks >> >> ### Performance >> >> it's run on bananapi. >> >> -COH-AvoidUnalignedAccesses >> >> ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 >> com.arm.benchmarks.... > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > refine comments I can't see how this end up improving perf, You have moved loading job for first few bytes from one place to another. Technically it's correct and if jmh shows improvement - let it be ------------- PR Comment: https://git.openjdk.org/jdk/pull/23495#issuecomment-2647644475 From adinn at openjdk.org Mon Feb 10 11:07:14 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 10 Feb 2025 11:07:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com> On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero and Minimal VM builds src/hotspot/share/code/codeBlob.cpp line 58: > 56: #include > 57: > 58: // Virtual methods are not allowed in code blobs to simplify caching compiled code. Is it worth considering generating this code plus also some of the existing code in the header using an iterator template macro? e.g. #define CODEBLOBS_DO(do_codeblob_abstract, do_codeblob_nonleaf, \ do_codeblob_leaf) \ do_codeblob_abstract(CodeBlob) \ do_codeblob_leaf(nmethod, Nmethod, nmethod) \ do_codeblob_abstract(RuntimeBlob) \ do_codeblob_nonleaf(BufferBlob, Buffer, buffer) \ do_codeblob_leaf(AdapterBlob, Adapter, adapter) \ . . . \ do_codeblob_leaf(RuntimeStub, Runtime_Stub, runtime_stub) \ . . . The macro arguments to the templates would themselves be macros: do_codeblob_abstract(classname) // abstract, non-instantiable class do_codeblob_nonleaf(classname, kindname, accessorname) // instantiable, subclassable do_codeblob_leaf(classname, kindname, accessorname) // instantiable, non-subclassable Using a template macro like this to generate the code below -- *plus also* some of the code currently declared piecemeal in the header -- would guarantee all cases are covered now and will remain so later so when the macro is updated. I think it would probably also allow case handling code in AOT cache code to be generated. So, we would generate the code here as follows #define EMPTY1(classname) #define EMPTY3(classname, kindname, accessorname) #define assert_nonvirtual_leaf(classname, kindname, accessorname) \ static_assert(!std::is_polymorphic::value, \ "no virtual methods are allowed in " # classname ); CODEBLOBS_DO(empty1, empty3, assert_nonvirtual_leaf) #undef assert_nonvirtual_leaf Likewise in codeBlob.hpp we could generate `enum CodeBlobKind` to cover all the non-abstract classes and likewise generate the accessor methods `is_nmethod()`, `is_buffer_blob()` in class `CodeBlob` which allow the kind to be tested. #define codekind_enum_tag(classname, kindname, accessorname) \ kindname, enum CodeBlobKind : u1 { None, CODEBLOBS_DO(empty1, codekind_enum_tag, codekind_enum_tag) Number_Of_Kinds }; #define is_codeblob_define(classname, kindname, accessorname) \ void is_ # accessor_name () { return _kind == kindname; } class CodeBlob { . . . CODEBLOBS_DO(empty1, is_codeblob_define, is_codeblob_define); . . . There may be other opportunities to use the iterator (e.g. in vmStructs.cpp?) but this looks like a good start. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948849392 From mli at openjdk.org Mon Feb 10 11:28:15 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 10 Feb 2025 11:28:15 GMT Subject: Integrated: 8349666: RISC-V: enable superwords tests for vector reductions In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 14:58:13 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > On riscv, some vector reduction intrinsics were already implemented, but they are not verified indeed. This patch is to enable these tests on riscv. > > Thanks This pull request has now been integrated. Changeset: 4a83ca12 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/4a83ca120293aecbf21d7d005ba256e95fe98299 Stats: 73 lines in 9 files changed: 64 ins; 0 del; 9 mod 8349666: RISC-V: enable superwords tests for vector reductions Reviewed-by: fyang, luhenry ------------- PR: https://git.openjdk.org/jdk/pull/23518 From mli at openjdk.org Mon Feb 10 11:28:14 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 10 Feb 2025 11:28:14 GMT Subject: RFR: 8349666: RISC-V: enable superwords tests for vector reductions In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 14:58:13 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > On riscv, some vector reduction intrinsics were already implemented, but they are not verified indeed. This patch is to enable these tests on riscv. > > Thanks Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23518#issuecomment-2647700535 From mli at openjdk.org Mon Feb 10 11:29:09 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 10 Feb 2025 11:29:09 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: On Mon, 10 Feb 2025 11:00:51 GMT, Vladimir Kempik wrote: > I can't see how this end up improving perf, You have moved loading job for first few bytes from one place to another. Technically it's correct and if jmh shows improvement - let it be The reason is that with -COH && -Avoid, some alignment instructions that were previously required are omitted. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23495#issuecomment-2647704435 From vkempik at openjdk.org Mon Feb 10 11:38:10 2025 From: vkempik at openjdk.org (Vladimir Kempik) Date: Mon, 10 Feb 2025 11:38:10 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: <_2uRao0X6IowzTOXHKcA94U-M0p1eam1AZeSAc3KNx0=.60dace4d-731c-4f73-85af-e10145d84468@github.com> On Fri, 7 Feb 2025 12:29:33 GMT, Hamlin Li wrote: >> Hi, >> >> Can you help to review the patch? >> >> It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). >> The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. >> >> Thanks >> >> ### Performance >> >> it's run on bananapi. >> >> -COH-AvoidUnalignedAccesses >> >> ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 >> com.arm.benchmarks.... > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > refine comments Marked as reviewed by vkempik (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23495#pullrequestreview-2605497756 From mli at openjdk.org Mon Feb 10 11:51:14 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 10 Feb 2025 11:51:14 GMT Subject: RFR: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison [v2] In-Reply-To: References: Message-ID: <3QptwjxMpRpnoZJYQTvEZj1bPnc0LQTHCXPl-aLLiJQ=.5845f6e1-d1b7-4021-adbf-aad6ed8407e9@github.com> On Fri, 7 Feb 2025 12:29:33 GMT, Hamlin Li wrote: >> Hi, >> >> Can you help to review the patch? >> >> It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). >> The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. >> >> Thanks >> >> ### Performance >> >> it's run on bananapi. >> >> -COH-AvoidUnalignedAccesses >> >> ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 >> com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 >> com.arm.benchmarks.... > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > refine comments Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23495#issuecomment-2647752715 From mli at openjdk.org Mon Feb 10 11:51:15 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 10 Feb 2025 11:51:15 GMT Subject: Integrated: 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 15:59:16 GMT, Hamlin Li wrote: > Hi, > > Can you help to review the patch? > > It tries to improve the string compare when AvoidUnalignedAccesses == false && encoding is LU or UL (i.e. 2 strings encodings are different with each other). > The jmh test shows when `-CompactObjectHeaders` (i.e. -COH) && `-AvoidUnalignedAccesses`, the patch bring much better performance, and in other cases, it does not bring obvious regression. And currently by default it's -COH. > > Thanks > > ### Performance > > it's run on bananapi. > > -COH-AvoidUnalignedAccesses > > ?-COH-Avoid? | (delta) | (size) | (utf16) | Mode | Cnt | Score - master | Score - patch | Error | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 24 | N/A | avgt | 10 | 6438443.073 | 6383881.891 | 36912.539 | ns/op | 0.009 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 36 | N/A | avgt | 10 | 9421176.34 | 9390907.1 | 21034.266 | ns/op | 0.003 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 72 | N/A | avgt | 10 | 18592342.33 | 16871350.38 | 15550.827 | ns/op | 0.102 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 128 | N/A | avgt | 10 | 30916157.05 | 28646961.11 | 9263.556 | ns/op | 0.079 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 256 | N/A | avgt | 10 | 58945069.17 | 55505097.77 | 8803.847 | ns/op | 0.062 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToLU | 2 | 512 | N/A | avgt | 10 | 115520355.5 | 110233842.8 | 35056.972 | ns/op | 0.048 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 24 | N/A | avgt | 10 | 7541299.83 | 7481385.995 | 43240.713 | ns/op | 0.008 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 36 | N/A | avgt | 10 | 10295051.77 | 10264978.04 | 38938.956 | ns/op | 0.003 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL | 2 | 72 | N/A | avgt | 10 | 19652419.64 | 17953481.41 | 10987.17 | ns/op | 0.095 > com.arm.benchmarks.intrinsics.StringCompareToDifferentLength.compareToUL ... This pull request has now been integrated. Changeset: d104debe Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/d104debe51d8feb35b7c672a9d05404208bc5526 Stats: 21 lines in 2 files changed: 4 ins; 15 del; 2 mod 8349556: RISC-V: improve the performance when -COH and -AvoidUnalignedAccesses for UL and LU string comparison Reviewed-by: fyang, vkempik ------------- PR: https://git.openjdk.org/jdk/pull/23495 From dlunden at openjdk.org Mon Feb 10 15:56:14 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 10 Feb 2025 15:56:14 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: Message-ID: <-KC83AQUZqpnDH0lG1yvKaVUV3H5kSN8cQLU1x4e06o=.dbe58228-89fe-4128-9cdb-3953d9215d59@github.com> On Fri, 7 Feb 2025 14:36:01 GMT, Daniel Lund?n wrote: >> Thanks for having yet another look at this! >> >>> If I comment out the line it solves all the failures we have seen. I double-checked that we then perform the exact >> MergeMem/Phi swap idealizations discussed above. >> >> That sounds promising! Looks like this temporary restriction became quite permanent - it's from initial load. I'm wondering if that is still necessary and if so if we have tests to catch that (we would probably hit the "infinite loop in IGVN" in that case). >> >>> I am wondering what the proper solution is here. I will, of course, investigate if it is possible to loosen the restriction and still ensure termination. On the other hand, it also seems strange that the anti-dependence search is so sensitive to missing idealizations? >> >> That would be great if we can get around this termination issue somehow - if it's still a problem. I think that is very unfortunate that we might be relying on this Ideal transformation to be applied to ensure correctness later on. If it's really required, we should at least make sure to add some verification code to catch this in debug builds. You could, for example, just turn what you have now into verification code, i.e. check that we cannot find another anti dependency edge with another search root. And/Or re-apply this particular transformation for each Phi node again in the end to see if we missed some swaps. > > I have now investigated the `PhiNode::Ideal` restriction above. In summary, I have not found any simple change that resolves the present issue *and* does not introduce problems elsewhere. > > Here is what I have tried. Both changes solve the present issue. > 1. Remove the restriction entirely. As the source code comment suggests, this results in non-termination (which is very easy to verify). The reason is that memory Phis are (naturally) often circular, and there are plenty of cases where we push MergeMems indefinitely across circular Phis. > 2. Only apply the idealization if we can guarantee that we are not pushing MergeMems over Phis in a circular manner. I check this through a complete upwards walk of the memory graph from the current Phi to ensure we cannot reach it from itself. This is likely quite expensive and we can probably do something more clever. It kind of works, but there are still tests that fail. Even if we now do terminate, I suspect we still have a combinatorial explosion of new split Phi nodes in certain cases, because we hit the MemLimit in many of the failing tests. > > I can and probably will continue to investigate option 2, but it feels like that should be a separate RFE. I'm open to suggestions. > >> You could, for example, just turn what you have now into verification code, i.e. check that we cannot find another anti dependency edge with another search root. > > @chhagedorn Yes, I agree. If we want to enforce a memory graph invariant at the time of `insert_anti_dependences`, we should also assert that it holds as best we can. > @dlunde do we know at which point in the compilation chain the disjoint memory state invariant (that the above idealization restores) is broken? Would it be possible to do some analysis at that point to "simply" avoid producing the problematic memory subgraph in the first place? I've looked at this before, and have now revisited it again. It *does* look like the problematic memory subgraph results due to loop peeling, even though I discarded this hypothesis before together with @chhagedorn. @chhagedorn In our investigation where you helped me manually turn off loop peeling (`TestNoPeeling.java` attached to the issue), we only turned off full loop peeling. We actually still run partial peeling in that example. If I add `-XX:-PartialPeelLoop` to the set of flags (in addition to your patch turning of standard loop peeling), the issue disappears. Below is a small example demonstrating what happens during loop peeling. Left is before partial peeling and right after. Before peeling, there is a path from `7 Parm` (`initial_mem`) to `183 Phi`. This ensures we properly raise the LCA and add anti-dependence edges. After peeling, we have cloned the loop body (in `PhaseIdealLoop::clone_loop`, resulting in the clone `325 MergeMem` of `231 MergeMem`) and then merged the clones with a new `339 Phi`. The path from `7 Parm` to `183 Phi` is now blocked and we fail to raise the LCA and add anti-dependence edges. ![Screenshot from 2025-02-10 16-29-15](https://github.com/user-attachments/assets/9551c43c-c72a-4bc5-989a-918d61e2e335) > I wonder if the core invariant we want to assert here is that two memory states with aliasing slices never overlap in time, after GCM and LCM are done. This could be checked by performing liveness analysis of the memory subgraph after GCM and LCM. This may sound expensive to compute but it could turn out to be acceptable in practice (for debug builds). The similarly expensive register-level liveness analysis in PhaseOutput::perform_mach_node_analysis takes no more than 1-2% of the entire C2 execution time, on average. Sounds like a great idea, but I think we need to discuss the details further first. It is not quite clear to me yet what it is we want to assert. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1949383171 From stefank at openjdk.org Mon Feb 10 16:26:12 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Mon, 10 Feb 2025 16:26:12 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero and Minimal VM builds We have a similar situation with oopDesc that are not allowed to have a vtable. The solution there is to use the Klass as the proxy vtable and then have a bunch of Klass::oop_ functions that act like virtual dispatch functions for associated oopDesc functions. I wonder if a similar approach can be use here? Such an approach would (to me at lest) have the benefit that we don't have to spread switch statements in various functions in the top-most class. If you are interested in seeing a prototype of this, take a look at this branch: https://github.com/openjdk/jdk/compare/master...stefank:jdk:code_blob_vptr Just a suggestion if you want to consider alternatives to these switch statements. ------------- PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2606457754 From kvn at openjdk.org Mon Feb 10 16:39:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Feb 2025 16:39:19 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com> Message-ID: <1P7Q-yHC0Ho8DPfgzZfxR27NmNQPJ4LcgEbilqdaVNw=.0c023c74-b3d9-4139-8363-5ebdf1a1805d@github.com> On Mon, 10 Feb 2025 11:04:38 GMT, Andrew Dinn wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero and Minimal VM builds > > src/hotspot/share/code/codeBlob.cpp line 58: > >> 56: #include >> 57: >> 58: // Virtual methods are not allowed in code blobs to simplify caching compiled code. > > Is it worth considering generating this code plus also some of the existing code in the header using an iterator template macro? e.g. > > #define CODEBLOBS_DO(do_codeblob_abstract, do_codeblob_nonleaf, \ > do_codeblob_leaf) \ > do_codeblob_abstract(CodeBlob) \ > do_codeblob_leaf(nmethod, Nmethod, nmethod) \ > do_codeblob_abstract(RuntimeBlob) \ > do_codeblob_nonleaf(BufferBlob, Buffer, buffer) \ > do_codeblob_leaf(AdapterBlob, Adapter, adapter) \ > . . . \ > do_codeblob_leaf(RuntimeStub, Runtime_Stub, runtime_stub) \ > . . . > > The macro arguments to the templates would themselves be macros: > > do_codeblob_abstract(classname) // abstract, non-instantiable class > do_codeblob_nonleaf(classname, kindname, accessorname) // instantiable, subclassable > do_codeblob_leaf(classname, kindname, accessorname) // instantiable, non-subclassable > > Using a template macro like this to generate the code below -- *plus also* some of the code currently declared piecemeal in the header -- would guarantee all cases are covered now and will remain so later so when the macro is updated. I think it would probably also allow case handling code in AOT cache code to be generated. > > So, we would generate the code here as follows > > #define EMPTY1(classname) > #define EMPTY3(classname, kindname, accessorname) > > #define assert_nonvirtual_leaf(classname, kindname, accessorname) \ > static_assert(!std::is_polymorphic::value, \ > "no virtual methods are allowed in " # classname ); > > CODEBLOBS_DO(empty1, empty3, assert_nonvirtual_leaf) > > #undef assert_nonvirtual_leaf > > Likewise in codeBlob.hpp we could generate `enum CodeBlobKind` to cover all the non-abstract classes and likewise generate the accessor methods `is_nmethod()`, `is_buffer_blob()` in class `CodeBlob` which allow the kind to be tested. > > #define codekind_enum_tag(classname, kindname, accessorname) \ > kindname, > > enum CodeBlobKind : u1 { > None, > CODEBLOBS_DO(empty1, codekind_enum_tag, codekind_enum_tag) > Number_Of_Kinds > }; > > ... Thank you @adinn for suggestion but no, I don't like macros - hard to debug and they add more complexity in this case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1949483501 From kvn at openjdk.org Mon Feb 10 16:50:12 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Feb 2025 16:50:12 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 03:25:30 GMT, Chris Plummer wrote: >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38: >> >>> 36: public class CodeCache { >>> 37: private static GrowableArray heapArray; >>> 38: private static VirtualConstructor virtualConstructor; >> >> What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. > > I think I found the answer. Since there is no longer a vtable, TypeDataBase.addressTypeIsEqualToType() will no longer work for Codeblobs. I was wondering if the lack of a vtable might have some negative impact. Glad to see you found a solution. I hope the lack of a vtable does not creep in elsewhere. I know it will have some negative impact on things like the "findpc" functionality, which will no longer be able to tell the user that an address points to a Codeblob instance. There's no test for this, but users might run across it. > What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. I don't need any more mapping from CodeBlob class to corresponding virtual table name which does not exist anymore. `CodeBlob::_kind` field's value is used to determine which class should be used. I think `hashMap` is overkill here. I can construct array `Class cbClasses[]` and use `cbClasses[CodeBlob::_kind]` instead of `if/else` in `getClassFor`. But I would still need to check for unknown value of `CodeBlob::_kind` somehow. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1949505126 From kvn at openjdk.org Mon Feb 10 17:06:13 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Feb 2025 17:06:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 16:23:53 GMT, Stefan Karlsson wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero and Minimal VM builds > > We have a similar situation with oopDesc that are not allowed to have a vtable. The solution there is to use the Klass as the proxy vtable and then have a bunch of Klass::oop_ functions that act like virtual dispatch functions for associated oopDesc functions. > > I wonder if a similar approach can be use here? Such an approach would (to me at lest) have the benefit that we don't have to spread switch statements in various functions in the top-most class. > > If you are interested in seeing a prototype of this, take a look at this branch: > https://github.com/openjdk/jdk/compare/master...stefank:jdk:code_blob_vptr > > Just a suggestion if you want to consider alternatives to these switch statements. Thank you, @stefank. This is very interesting suggestion which I may take. I will check it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2648688942 From duke at openjdk.org Mon Feb 10 21:05:28 2025 From: duke at openjdk.org (Mohamed Issa) Date: Mon, 10 Feb 2025 21:05:28 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification Message-ID: A fix for incorrectly defined program segments in Windows SVML assembly. - Changes _READ_ to _READONLY_ in all math functions - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) The tier1 tests show the changes didn't introduce new failures. ------------- Commit messages: - Change READ to READONLY to comply with MASM program segment specification Changes: https://git.openjdk.org/jdk/pull/23503/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23503&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349579 Stats: 39 lines in 36 files changed: 0 ins; 0 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/23503.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23503/head:pull/23503 PR: https://git.openjdk.org/jdk/pull/23503 From sviswanathan at openjdk.org Mon Feb 10 21:05:28 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 10 Feb 2025 21:05:28 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 21:34:39 GMT, Mohamed Issa wrote: > A fix for incorrectly defined program segments in Windows SVML assembly. > > - Changes _READ_ to _READONLY_ in all math functions > - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) > > The tier1 tests show the changes didn't introduce new failures. Mohamed Issa (github id @missa-prime) is part of Intel Java team and is covered by Intel OCA. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23503#issuecomment-2641310778 From jwaters at openjdk.org Mon Feb 10 21:05:28 2025 From: jwaters at openjdk.org (Julian Waters) Date: Mon, 10 Feb 2025 21:05:28 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 21:34:39 GMT, Mohamed Issa wrote: > A fix for incorrectly defined program segments in Windows SVML assembly. > > - Changes _READ_ to _READONLY_ in all math functions > - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) > > The tier1 tests show the changes didn't introduce new failures. Just my 2 cents, this would be more accurately described as jsvml.dll rather than libsvml.so, since this is a Windows only issue ------------- PR Comment: https://git.openjdk.org/jdk/pull/23503#issuecomment-2641962424 From duke at openjdk.org Mon Feb 10 21:05:28 2025 From: duke at openjdk.org (Mohamed Issa) Date: Mon, 10 Feb 2025 21:05:28 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 05:11:34 GMT, Julian Waters wrote: > Just my 2 cents, this would be more accurately described as jsvml.dll rather than libsvml.so, since this is a Windows only issue Updated - thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23503#issuecomment-2643455398 From duke at openjdk.org Mon Feb 10 21:25:00 2025 From: duke at openjdk.org (erifan) Date: Mon, 10 Feb 2025 21:25:00 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() Message-ID: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: 1. x*8 can be optimized as x<<3. 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. Currently OpenJDK implemented a few such patterns in mid-end, including: 1. |C| = 1<0) 2. |C| = (1<0) 3. |C| = (1<n, n>=0) The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: 3.1. C = (1<0) 3.2. C = -((1<0) 3.3. C = (1<n, n>0) 3.4. C = -((1<n, n>0) According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. Before (MUL is not converted): mov x1, #C mul x2, x1, x0 Now (MUL is converted): For 3.1: add x2, x0, x0, lsl #n For 3.2: add x2, x0, x0, lsl #n // same cost with mul if n > 4 neg x2, x2 For 3.3: lsl x1, x0, #m add x2, x1, x0, lsl #n // same cost with mul if n > 4 For 3.4: lsl x1, x0, #m add x2, x1, x0, lsl #n // same cost with mul if n > 4 neg x2, x2 Test results (ns/op) on Arm Neoverse V2: Before Now Uplift Pattern Notes testInt9 103.379 60.702 1.70305756 3.1 testIntN33 103.231 106.825 0.96635619 3.2 n > 4 testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 **(S1) From this point on, we should treat pattern 3 as follows:** 3.1 C = (1<0) 3.2 C = -((1<n, 00) 3.2 C = -((1<n, n>0) 3.4 C = -((1<0, 1.7) (disable, 0.75) (n>0, 1.3) 3.2 (0n, 0n, n>0, 1.03) 3.4 (disable, 0.74) (disable, 0.30) (disable, 0.74) For 3.1, it's similar with pattern 2, usually be lowered as only one instruction, so we tend to keep it in mid-end. For 3.2, we tend to disable it in mid-end, and do S1 in back-end if it's profitable. For 3.3, although S3 has 3% performance gain, but S2 has 31% performance regression. So we tend to disable it in mid-end and redo S1 in back-end. For 3.4, we shouldn't do this optimization anywhere. In theory, auto-vectorization should be able to generate the best vectorized code, and cases that cannot be vectorized will be converted into other more optimal scalar instructions in the architecture backend (this is what gcc and llvm do). However, we currently do not have a cost model and vplan, and the results of auto-vectorization are significantly affected by its input. Therefore, this patch turns off pattern 3.2, 3.3 and 3.4 in mid-end. Then if it's profitable, implement these patterns in the backend. If we implement a cost model and vplan in the future, it is best to move all patterns to the backend, this patch does not conflict with this direction. I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines, No noticeable performance degradation was seen on any of the machines. Here are the test results on an Arm V2 and an AMD Genoa machine: Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Pattern Notes testInt8 60.36989 60.276736 1 116.768294 116.772547 0.99 1 testInt8AddSum 63.658064 63.797732 0.99 16.04973 16.051491 0.99 1 testInt8Store 38.829618 39.054129 0.99 19.857453 20.006321 0.99 1 testIntN8 59.99655 60.150053 0.99 132.269926 132.252473 1 1 testIntN8AddSum 145.678098 146.181549 0.99 158.546226 158.806476 0.99 1 testIntN8Store 32.802445 32.897907 0.99 19.047873 19.065941 0.99 1 testInt7 98.978213 99.176574 0.99 114.07026 113.08989 1 2 testInt7AddSum 62.675636 62.310799 1 23.370851 20.971655 1.11 2 testInt7Store 32.850828 32.923315 0.99 23.884952 23.628681 1.01 2 testIntN7 60.27949 60.668158 0.99 174.224893 174.102295 1 2 testIntN7AddSum 62.746696 62.288476 1 20.93192 20.964557 0.99 2 testIntN7Store 32.812906 32.851355 0.99 23.810024 23.526074 1.01 2 testInt9 60.820402 60.331938 1 108.850777 108.846161 1 3.1 testInt9AddSum 62.24679 62.374637 0.99 20.698749 20.741137 0.99 3.1 testInt9Store 32.871723 32.912065 0.99 19.055537 19.080735 0.99 3.1 testIntN33 106.517618 103.450746 1.02 153.894345 140.641135 1.09 3.2 n > 4 testIntN33AddSum 147.589815 47.911612 3.08 153.851885 17.008453 9.04 3.2 testIntN33Store 75.434513 43.473053 1.73 26.612181 20.436323 1.3 3.2 testIntN9 102.173268 103.70682 0.98 155.858169 140.718967 1.1 3.2 n <= 4 testIntN9AddSum 148.724952 47.963305 3.1 186.902111 20.249414 9.23 3.2 testIntN9Store 74.783788 43.339188 1.72 20.150159 20.888448 0.96 3.2 testInt18 98.905625 102.942092 0.96 142.480636 140.748778 1.01 3.3 m <= 4, n <= 4 testInt18AddSum 68.695585 48.103536 1.42 26.88524 16.77886 1.6 3.3 testInt18Store 41.307909 43.385183 0.95 21.233238 20.875026 1.01 3.3 testInt36 99.039742 103.714745 0.95 142.265806 142.334039 0.99 3.3 m > 4, n <= 4 testInt36AddSum 68.736756 47.952189 1.43 26.868362 17.030035 1.57 3.3 testInt36Store 41.403698 43.414093 0.95 21.225454 20.52266 1.03 3.3 testInt96 105.00287 103.528144 1.01 237.649526 140.643255 1.68 3.3 m > 4, n > 4 testInt96AddSum 68.481133 48.04549 1.42 26.877407 16.918209 1.58 3.3 testInt96Store 41.276292 43.512994 0.94 23.456117 20.540181 1.14 3.3 testIntN18 138.629044 103.269657 1.34 210.315628 140.716818 1.49 3.4 m <= 4, n <= 4 testIntN18AddSum 156.635652 48.003989 3.26 215.807135 16.917665 12.75 3.4 testIntN18Store 57.584487 43.410415 1.32 26.819827 20.707778 1.29 3.4 testIntN36 139.068861 103.766774 1.34 209.522432 140.720322 1.48 3.4 m > 4, n <= 4 testIntN36AddSum 156.36928 48.027779 3.25 215.705842 16.893192 12.76 3.4 testIntN36Store 57.715418 43.493958 1.32 21.651252 20.676877 1.04 3.4 testIntN96 139.151761 103.453665 1.34 269.254161 140.753499 1.91 3.4 m > 4, n > 4 testIntN96AddSum 153.123557 48.110524 3.18 263.262635 17.011144 15.47 3.4 testIntN96Store 57.793179 43.47574 1.32 24.444592 20.530219 1.19 3.4 limitations: 1, This patch only analyzes two vector cases, there may be other vector cases that may get performance regression with this patch. 2, This patch does not implement the disabled patterns in the backend, I will propose a follow-up patch to implement these patterns in the aarch64 backend. 3, This patch does not handle the long type, because different architectures have different auto-vectorization support for long type, resulting in very different performance, and it is difficult to find a solution that does not introduce significant performance degradation. ------------- Commit messages: - 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() Changes: https://git.openjdk.org/jdk/pull/22922/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22922&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8346964 Stats: 472 lines in 5 files changed: 430 ins; 15 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/22922.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22922/head:pull/22922 PR: https://git.openjdk.org/jdk/pull/22922 From epeter at openjdk.org Mon Feb 10 21:25:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Mon, 6 Jan 2025 07:55:39 GMT, erifan wrote: > Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: > 1. x*8 can be optimized as x<<3. > 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. > > Currently OpenJDK implemented a few such patterns in mid-end, including: > 1. |C| = 1<0) > 2. |C| = (1<0) > 3. |C| = (1<n, n>=0) > > The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. > > But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: > 3.1. C = (1<0) > 3.2. C = -((1<0) > 3.3. C = (1<n, n>0) > 3.4. C = -((1<n, n>0) > > According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. > > Before (MUL is not converted): > > mov x1, #C > mul x2, x1, x0 > > > Now (MUL is converted): > For 3.1: > > add x2, x0, x0, lsl #n > > > For 3.2: > > add x2, x0, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > For 3.3: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > > > For 3.4: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > Test results (ns/op) on Arm Neoverse V2: > > Before Now Uplift Pattern Notes > testInt9 103.379 60.702 1.70305756 3.1 > testIntN33 103.231 106.825 0.96635619 3.2 n > 4 > testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 > testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 > testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 > testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 > testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 > testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 > testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 > > > **(S1) From this point on, we should treat pattern 3 as follows:** > 3.1 C = (1<0) > 3.2 C = -((1< 3.3 C = (1<n, 0 3.4 C = -((1< > Since this conversion is implemented in mid-end, it impacts... Hi @erifan I have some first questions / comments. I only scanned through quickly. My biggest question: You only mention aarch64. But we would need to know that your changes also work well on x64. Also: Can you summarize if your changes are only for performane of vectorization, or also for scalar code? src/hotspot/share/opto/mulnode.cpp line 253: > 251: } > 252: > 253: // TODO: abs_con = (1< 261: // > 262: // But if it's not vectorizable, maybe it's profitable to do the conversion on > 263: // some architectures, support it in backends if it's worthwhile. If it is all about vectorization only: we could consider delaying the `Ideal` optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize. src/hotspot/share/opto/mulnode.cpp line 273: > 271: new LShiftINode(in(1), phase->intcon(log2i_exact(abs_con - 1)))); > 272: res = new AddINode(in(1), n1); > 273: } else if (is_power_of_2(abs_con + 1)) { So now you only check for `power_of_2 +- 1`, right? But before we also looked at patterns with 2 bits, such as `64 + 8`. You would really need to prove that this is not a loss on any of the platforms we care about, incl. x64. test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 148: > 146: @IR(counts = { IRNode.MUL_I, "1" }) > 147: private static int addTo6(int a) { > 148: return a + a + a + a + a + a; Is this an improvement? test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java line 1059: > 1057: b[i * 6 + 1] = (byte) (a[i * 6 + 1] & mask); > 1058: b[i * 6 + 2] = (byte) (a[i * 6 + 2] & mask); > 1059: b[i * 6 + 3] = (byte) (a[i * 6 + 3] & mask); Why did this change? Was `i * 6` not supposed to be changed to `(i << 2) + (i << 1)`? This is the reason why the current impl of VPointer is not parsing this right, but after my patch https://github.com/openjdk/jdk/pull/21926 this will be fixed because we will parse the multiple occurances of `i` properly. So it looks like you are now sometimes keeping it at `i * 6` instead of splitting it. Why? ------------- PR Review: https://git.openjdk.org/jdk/pull/22922#pullrequestreview-2534960901 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905807007 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905813470 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905816741 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905808320 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905802482 From duke at openjdk.org Mon Feb 10 21:25:01 2025 From: duke at openjdk.org (erifan) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Tue, 7 Jan 2025 17:30:04 GMT, Emanuel Peter wrote: >> Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: >> 1. x*8 can be optimized as x<<3. >> 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. >> >> Currently OpenJDK implemented a few such patterns in mid-end, including: >> 1. |C| = 1<0) >> 2. |C| = (1<0) >> 3. |C| = (1<n, n>=0) >> >> The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. >> >> But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: >> 3.1. C = (1<0) >> 3.2. C = -((1<0) >> 3.3. C = (1<n, n>0) >> 3.4. C = -((1<n, n>0) >> >> According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. >> >> Before (MUL is not converted): >> >> mov x1, #C >> mul x2, x1, x0 >> >> >> Now (MUL is converted): >> For 3.1: >> >> add x2, x0, x0, lsl #n >> >> >> For 3.2: >> >> add x2, x0, x0, lsl #n // same cost with mul if n > 4 >> neg x2, x2 >> >> >> For 3.3: >> >> lsl x1, x0, #m >> add x2, x1, x0, lsl #n // same cost with mul if n > 4 >> >> >> For 3.4: >> >> lsl x1, x0, #m >> add x2, x1, x0, lsl #n // same cost with mul if n > 4 >> neg x2, x2 >> >> >> Test results (ns/op) on Arm Neoverse V2: >> >> Before Now Uplift Pattern Notes >> testInt9 103.379 60.702 1.70305756 3.1 >> testIntN33 103.231 106.825 0.96635619 3.2 n > 4 >> testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 >> testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 >> testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 >> testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 >> testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 >> testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 >> testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 >> >> >> **(S1) From this point on, we should treat pattern 3 as follows:** >> 3.1 C = (1<0) >> 3.2 C = -((1<> 3.3 C... > > Hi @erifan > > I have some first questions / comments. I only scanned through quickly. > > My biggest question: > You only mention aarch64. But we would need to know that your changes also work well on x64. > > Also: > Can you summarize if your changes are only for performane of vectorization, or also for scalar code? Hi @eme64 , thanks for your review! > My biggest question: You only mention aarch64. But we would need to know that your changes also work well on x64. Yes, this patch also benefits x86_64 platform. I tested the patch on aarch64 V2 and N1 processors, AMD64 Genoa and Intel SPR processors (I can provide the test results if necessary). The test results show that for some cases the performance uplift is very large and there is no obvious performance degradation. I'm not familiar with x86_64 instruction set, so I didn't do any theoretical analysis on x86_64 platform, I did very detailed theoretical analysis on aarch64 platform. I don't have machines with architectures other than aarch64 and x64, so this patch is not tested on platforms except for aarch64 and x64. > Also: Can you summarize if your changes are only for performane of vectorization, or also for scalar code? For both vectorization and scalar cases. For example the pattern x * C => -((x< src/hotspot/share/opto/mulnode.cpp line 253: > >> 251: } >> 252: >> 253: // TODO: abs_con = (1< > Is this `TODO` here on purpose? Yes, it's a reminder to implement these two patterns for scalar cases in backends if they are worthwhile on the specific architecture. > src/hotspot/share/opto/mulnode.cpp line 263: > >> 261: // >> 262: // But if it's not vectorizable, maybe it's profitable to do the conversion on >> 263: // some architectures, support it in backends if it's worthwhile. > > If it is all about vectorization only: we could consider delaying the `Ideal` optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize. But for some cases, it's better to do the conversion before vectorization, for example: `x * 8 => x << 3 ` Through test results and theoretical analysis (only on aarch64, see the commit message), I found that we'd better to do the conversion before vectorization if the multiplication can be transformed to less than one or two shift/add instructions. > src/hotspot/share/opto/mulnode.cpp line 273: > >> 271: new LShiftINode(in(1), phase->intcon(log2i_exact(abs_con - 1)))); >> 272: res = new AddINode(in(1), n1); >> 273: } else if (is_power_of_2(abs_con + 1)) { > > So now you only check for `power_of_2 +- 1`, right? But before we also looked at patterns with 2 bits, such as `64 + 8`. > > You would really need to prove that this is not a loss on any of the platforms we care about, incl. x64. Yes, I have tested the patch on several of x64 machines, including amd Genoa and Intel SPR and some older x86 machines, there's no noticeable performance loss. Test results on aarch64 and amd Genoa were included in the commit message, please take a look, thanks~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2576610164 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906216214 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906220122 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906222636 From epeter at openjdk.org Mon Feb 10 21:25:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Wed, 8 Jan 2025 03:00:06 GMT, erifan wrote: >> Hi @erifan >> >> I have some first questions / comments. I only scanned through quickly. >> >> My biggest question: >> You only mention aarch64. But we would need to know that your changes also work well on x64. >> >> Also: >> Can you summarize if your changes are only for performane of vectorization, or also for scalar code? > > Hi @eme64 , thanks for your review! > >> My biggest question: You only mention aarch64. But we would need to know that your changes also work well on x64. > > Yes, this patch also benefits x86_64 platform. I tested the patch on aarch64 V2 and N1 processors, AMD64 Genoa and Intel SPR processors (I can provide the test results if necessary). The test results show that for some cases the performance uplift is very large and there is no obvious performance degradation. I'm not familiar with x86_64 instruction set, so I didn't do any theoretical analysis on x86_64 platform, I did very detailed theoretical analysis on aarch64 platform. > > I don't have machines with architectures other than aarch64 and x64, so this patch is not tested on platforms except for aarch64 and x64. > >> Also: Can you summarize if your changes are only for performane of vectorization, or also for scalar code? > > For both vectorization and scalar cases. For example the pattern x * C => -((x< > As mentioned in my commit message, this patch is good for most cases, but there are small performance loss for some cases. Ideally, if we implement vplan, I think it would be better to keep the multiplication operation before vectorization and let the vectorizer generate the optimal vector code. Then for the scalar case that cannot be vectorized, convert to other more optimal instructions in the backend ( or in [this pass](https://github.com/openjdk/jdk/pull/21599) if it's merged) if necessary. Hi @erifan Thanks for your responses. I now looked at the benchmark results. I see regressions in the range of 5% on both of your tested platforms. I'm hesitant to accept that now without the follow-up patches standing first. Maybe you can change the order of your RFE's so that we have no regressions in between? Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review. Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read. > f we implement vplan What do you mean by `VPlan`? Are you talking about LLVM? I am working on something a little similar with `VTransform`. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right? What do you think? >> src/hotspot/share/opto/mulnode.cpp line 253: >> >>> 251: } >>> 252: >>> 253: // TODO: abs_con = (1<> >> Is this `TODO` here on purpose? > > Yes, it's a reminder to implement these two patterns for scalar cases in backends if they are worthwhile on the specific architecture. Aha. I see. We don't leave TODO's in the code. Because nobody will ever look at it again. If we agree to go ahead with this, then you should rather file an RFE to keep track of this. But before you file it, let's first discuss the over-all strategy. >> src/hotspot/share/opto/mulnode.cpp line 263: >> >>> 261: // >>> 262: // But if it's not vectorizable, maybe it's profitable to do the conversion on >>> 263: // some architectures, support it in backends if it's worthwhile. >> >> If it is all about vectorization only: we could consider delaying the `Ideal` optimization until after loop-opts. Then we can keep the multiplication for vectorization, and only use shift/add once we know that we cannot vectorize. > > But for some cases, it's better to do the conversion before vectorization, for example: > `x * 8 => x << 3 ` > > Through test results and theoretical analysis (only on aarch64, see the commit message), I found that we'd better to do the conversion before vectorization if the multiplication can be transformed to less than one or two shift/add instructions. In theory we could also do a transform with vectors, and convert a vector mul to a vector shift, right? It is a bit scary to have these cases where some are better before and some better after vectorization. Makes performance quite unpredictable - your change may introduce improvements in some cases but regressions in others. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2576897834 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906563179 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906567037 From duke at openjdk.org Mon Feb 10 21:25:01 2025 From: duke at openjdk.org (erifan) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Mon, 6 Jan 2025 07:55:39 GMT, erifan wrote: > Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: > 1. x*8 can be optimized as x<<3. > 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. > > Currently OpenJDK implemented a few such patterns in mid-end, including: > 1. |C| = 1<0) > 2. |C| = (1<0) > 3. |C| = (1<n, n>=0) > > The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. > > But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: > 3.1. C = (1<0) > 3.2. C = -((1<0) > 3.3. C = (1<n, n>0) > 3.4. C = -((1<n, n>0) > > According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. > > Before (MUL is not converted): > > mov x1, #C > mul x2, x1, x0 > > > Now (MUL is converted): > For 3.1: > > add x2, x0, x0, lsl #n > > > For 3.2: > > add x2, x0, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > For 3.3: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > > > For 3.4: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > Test results (ns/op) on Arm Neoverse V2: > > Before Now Uplift Pattern Notes > testInt9 103.379 60.702 1.70305756 3.1 > testIntN33 103.231 106.825 0.96635619 3.2 n > 4 > testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 > testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 > testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 > testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 > testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 > testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 > testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 > > > **(S1) From this point on, we should treat pattern 3 as follows:** > 3.1 C = (1<0) > 3.2 C = -((1< 3.3 C = (1<n, 0 3.4 C = -((1< > Since this conversion is implemented in mid-end, it impacts... About the OCA, I am an employee from NVIDIA's Java compiler team, and NVIDIA has signed OCA. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2576971084 From duke at openjdk.org Mon Feb 10 21:25:01 2025 From: duke at openjdk.org (erifan) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Wed, 8 Jan 2025 07:08:43 GMT, Emanuel Peter wrote: > Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review. Ok, I'll combine the patches and file one big patch, and update the test results. > Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read. OK, I can provide test results on aarch64 V2, N1, AMD64 Genoa and Intel SPR processors. > What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right? Yes, I mean LLVM VPlan, I noticed your related work, it's very interesting. > it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Yes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2576991041 From epeter at openjdk.org Mon Feb 10 21:25:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: <6PtcpyIAXa2wbi0CI5-DVvI1r2RRDvKtIWko7nvBDFo=.49b4d6f7-0dda-42e7-9f51-bfa3c06ef6f5@github.com> On Wed, 8 Jan 2025 08:03:54 GMT, erifan wrote: >> Hi @erifan >> >> Thanks for your responses. I now looked at the benchmark results. >> I see regressions in the range of 5% on both of your tested platforms. I'm hesitant to accept that now without the follow-up patches standing first. Maybe you can change the order of your RFE's so that we have no regressions in between? >> >> Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review. >> >> Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read. >> >>> f we implement vplan >> What do you mean by `VPlan`? Are you talking about LLVM? I am working on something a little similar with `VTransform`. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right? >> >> What do you think? > >> Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review. > > Ok, I'll combine the patches and file one big patch, and update the test results. > >> Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read. > > OK, I can provide test results on aarch64 V2, N1, AMD64 Genoa and Intel SPR processors. > >> What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right? > > Yes, I mean LLVM VPlan, I noticed your related work, it's very interesting. > >> it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. > > Yes. @erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here. For `MulINode::Ideal` with patterns `var * con`, we really have these options in assembly: - `mul` general case. - `shift` and `add` when profitable. - `lea` could this be an improvement over `shift` and `add`? The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember `mul` is not always available on all `ALU`s, but `add` and `shift` should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache. Additionally: if your workload has other `mul`, `shift` and `add` mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take. And then the characteristics of scalar ops may not be identical to vector ops. It would be interesting to have a really solid benchmark, where you explore the impact of these different effects. And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune. Maybe perfect tuning is not possible. Maybe we are willing to take a `5%` regression in some cases to boost other cases by `30%`. But that is a **big maybe**: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate. Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. **But I think the `30%` speedup indicates that this work could really have potential!** As to what to do in sequence, here a suggestion: 1. First work on Vector API cases of vector multiplication - this should have no impact on other things. 2. Delay the `MulINode::Ideal` optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into `MulV`. And then go into the mul -> shift optimization for vectors under point 1. 3. Tackle `MulINode::Ideal` for scalar cases after loop-opts, and see what you can do there. This way you can separate scalar and vector changes. All of this really depends on very good benchmarks, and benchmarks on various platforms. Good presentation would be key here. I find tables with numbers important, but a visual representation on top would be good too - it can give an easier overview over the patterns. And please investigate when `lea` is applicable / profitable. We may also want to get people from ARM and Intel into this discussion at some point. That's enough for now. Let me know what you think :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2579182815 From duke at openjdk.org Mon Feb 10 21:25:01 2025 From: duke at openjdk.org (erifan) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: <6PtcpyIAXa2wbi0CI5-DVvI1r2RRDvKtIWko7nvBDFo=.49b4d6f7-0dda-42e7-9f51-bfa3c06ef6f5@github.com> References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> <6PtcpyIAXa2wbi0CI5-DVvI1r2RRDvKtIWko7nvBDFo=.49b4d6f7-0dda-42e7-9f51-bfa3c06ef6f5@github.com> Message-ID: On Thu, 9 Jan 2025 04:59:07 GMT, Emanuel Peter wrote: >>> Maybe you could even create one big patch with everything in it, just so that we can see that there are no regressions. Then split it up into parts (multiple RFEs) for easier review. >> >> Ok, I'll combine the patches and file one big patch, and update the test results. >> >>> Also: It would be nice to see benchmarks on as many different architectures as you can show. And please make sure that the table is nicely aligned - currently it is a bit difficult to read. >> >> OK, I can provide test results on aarch64 V2, N1, AMD64 Genoa and Intel SPR processors. >> >>> What do you mean by VPlan? Are you talking about LLVM? I am working on something a little similar with VTransform. But I'm not sure if it is relevant. I mean in theory you can also use the vector API, and then it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. Actually, maybe you should first address that: which vector mul could be vector shift. That would be a nice stand-alone change that you could implement without regressions, right? >> >> Yes, I mean LLVM VPlan, I noticed your related work, it's very interesting. >> >>> it would be nice if you had a vector mul, that this would also be changed into shifts if that is more profitable. >> >> Yes. > > @erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here. > > For `MulINode::Ideal` with patterns `var * con`, we really have these options in assembly: > - `mul` general case. > - `shift` and `add` when profitable. > - `lea` could this be an improvement over `shift` and `add`? > > The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember `mul` is not always available on all `ALU`s, but `add` and `shift` should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache. > > Additionally: if your workload has other `mul`, `shift` and `add` mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take. > > And then the characteristics of scalar ops may not be identical to vector ops. > > It would be interesting to have a really solid benchmark, where you explore the impact of these different effects. > And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune. > > Maybe perfect tuning is not possible. Maybe we are willing to take a `5%` regression in some cases to boost other cases by `30%`. But that is a **big maybe**: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate. > > Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. **But I think the `30%` speedup indicates that this work could really have potential!** > > As to what to do in sequence, here a suggestion: > 1. First work on Vector API cases of vector multiplication - this should have no impact on other things. > 2. Delay the `MulINode::Ideal` optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into `MulV`. And then go into the mul -> shift optimization for vectors under point 1. > 3.... Hi @eme64 thanks for your review. 1. First work on Vector API cases of vector multiplication - this should have no impact on other things. 2. Delay the MulINode::Ideal optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into MulV. And then go into the mul -> shift optimization for vectors under point 1. 3. Tackle MulINode::Ideal for scalar cases after loop-opts, and see what you can do there. I agree with you. I am actually working on `1`. The slightly troublesome thing is that `1` and `3` are both related to the architecture, so it might take a little more time. > lea could this be an improvement over shift and add? AARCH64 doesn't actually have a `lea` instruction. On x64 there are already some rules that turn `shift add` into `lea`. The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember mul is not always available on all ALUs, but add and shift should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache. Additionally: if your workload has other mul, shift and add mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take. And then the characteristics of scalar ops may not be identical to vector ops. Yes this is very trick, the actual performance is related to many aspects, such as pipeline, latency, throughput, ROB, and even memory performance. We can only do optimization based on certain references and generalities, such as latency and throughput of different instructions. But when it comes to generalities, it is actually difficult to say which scenario is more general. > It would be interesting to have a really solid benchmark, where you explore the impact of these different effects. And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune. I don't know such a benchmark suite yet. For AARCH64, I usually refer to [the Arm Optimization Guide](https://developer.arm.com/documentation/109898/latest/), but some instructions seem to be missing there. I guess AMD and Intel should have similar documents? I'm not sure. > Maybe perfect tuning is not possible. Maybe we are willing to take a 5% regression in some cases to boost other cases by 30%. But that is a big maybe: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate. Yes I agree, I will deal with the performance loss. > And please investigate when lea is applicable / profitable. Do you mean shift add => lea ? I think this is already done on x64. > We may also want to get people from ARM and Intel into this discussion at some point. Yes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2579263004 From epeter at openjdk.org Mon Feb 10 21:25:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> <6PtcpyIAXa2wbi0CI5-DVvI1r2RRDvKtIWko7nvBDFo=.49b4d6f7-0dda-42e7-9f51-bfa3c06ef6f5@github.com> Message-ID: On Thu, 9 Jan 2025 06:21:14 GMT, erifan wrote: >> @erifan I did some more thinking when falling asleep / waking up. This is a really interesting problem here. >> >> For `MulINode::Ideal` with patterns `var * con`, we really have these options in assembly: >> - `mul` general case. >> - `shift` and `add` when profitable. >> - `lea` could this be an improvement over `shift` and `add`? >> >> The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember `mul` is not always available on all `ALU`s, but `add` and `shift` should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache. >> >> Additionally: if your workload has other `mul`, `shift` and `add` mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take. >> >> And then the characteristics of scalar ops may not be identical to vector ops. >> >> It would be interesting to have a really solid benchmark, where you explore the impact of these different effects. >> And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune. >> >> Maybe perfect tuning is not possible. Maybe we are willing to take a `5%` regression in some cases to boost other cases by `30%`. But that is a **big maybe**: we really do not like getting regressions in existing code, it tends to upset people more if they get regressions compared to how much they enjoy speedups - so work like this can be delicate. >> >> Anyway, I don't right now have much time to investigate and work on this myself. So you'd have to do the work, benchmark, explanation etc. **But I think the `30%` speedup indicates that this work could really have potential!** >> >> As to what to do in sequence, here a suggestion: >> 1. First work on Vector API cases of vector multiplication - this should have no impact on other things. >> 2. Delay the `MulINode::Ideal` optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into `MulV`. And then go into the mul -> sh... > > Hi @eme64 thanks for your review. > > 1. First work on Vector API cases of vector multiplication - this should have no impact on other things. > 2. Delay the MulINode::Ideal optimizations until after loop-opts: scalar code would still be handled in the old way, but auto-vectorized code would then be turned into MulV. And then go into the mul -> shift optimization for vectors under point 1. > 3. Tackle MulINode::Ideal for scalar cases after loop-opts, and see what you can do there. > > I agree with you. I am actually working on `1`. The slightly troublesome thing is that `1` and `3` are both related to the architecture, so it might take a little more time. > >> lea could this be an improvement over shift and add? > > AARCH64 doesn't actually have a `lea` instruction. On x64 there are already some rules that turn `shift add` into `lea`. > > The issue is that different platforms have different characteristics here for these instructions - we would have to see how they differ. As far as I remember mul is not always available on all ALUs, but add and shift should be available. This impacts their throughput (more ports / ALU means more throughput generally). But the instructions also have different latency. Further, I could imagine that at some point more instructions may not just affect the throughput, but also the code-size: that in turn would increase IR and may at some point affect the instruction cache. > > Additionally: if your workload has other mul, shift and add mixed in, then some ports may already be saturated, and that could tilt the balance as to which option you are supposed to take. > > And then the characteristics of scalar ops may not be identical to vector ops. > > > Yes this is very trick, the actual performance is related to many aspects, such as pipeline, latency, throughput, ROB, and even memory performance. We can only do optimization based on certain references and generalities, such as latency and throughput of different instructions. But when it comes to generalities, it is actually difficult to say which scenario is more general. > >> It would be interesting to have a really solid benchmark, where you explore the impact of these different effects. > And it would be interesting to extract a table of latency + throughput characteristics for all relevant scalar + vector ops, for a number of different CPUs. Just so we get an overview of how easy this is to tune. > > I don't know such a benchmark suite yet. For AARCH64, I usually refer to [the Arm Optimization Guide](https:... @erifan Amazing, thanks for the enthusiasm :) Looking forward to what you come up with! ------------- PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2579268951 From jkarthikeyan at openjdk.org Mon Feb 10 21:25:01 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Mon, 6 Jan 2025 07:55:39 GMT, erifan wrote: > Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: > 1. x*8 can be optimized as x<<3. > 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. > > Currently OpenJDK implemented a few such patterns in mid-end, including: > 1. |C| = 1<0) > 2. |C| = (1<0) > 3. |C| = (1<n, n>=0) > > The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. > > But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: > 3.1. C = (1<0) > 3.2. C = -((1<0) > 3.3. C = (1<n, n>0) > 3.4. C = -((1<n, n>0) > > According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. > > Before (MUL is not converted): > > mov x1, #C > mul x2, x1, x0 > > > Now (MUL is converted): > For 3.1: > > add x2, x0, x0, lsl #n > > > For 3.2: > > add x2, x0, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > For 3.3: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > > > For 3.4: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > Test results (ns/op) on Arm Neoverse V2: > > Before Now Uplift Pattern Notes > testInt9 103.379 60.702 1.70305756 3.1 > testIntN33 103.231 106.825 0.96635619 3.2 n > 4 > testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 > testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 > testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 > testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 > testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 > testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 > testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 > > > **(S1) From this point on, we should treat pattern 3 as follows:** > 3.1 C = (1<0) > 3.2 C = -((1< 3.3 C = (1<n, 0 3.4 C = -((1< > Since this conversion is implemented in mid-end, it impacts... I also think that moving mul-related idealization to happen after loop opts would be a good idea. In the past I've noticed that patterns such as `x * 4 * 4` do not properly fold into a single multiply, because the mul is strength reduced into left shifts before constant folding takes place (shifts aren't currently constant folded either, but I think that is a separate issue). By doing such strength reductions after loop opts we can fix that issue as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22922#issuecomment-2629209771 From duke at openjdk.org Mon Feb 10 21:25:01 2025 From: duke at openjdk.org (erifan) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Wed, 8 Jan 2025 01:35:08 GMT, erifan wrote: >> src/hotspot/share/opto/mulnode.cpp line 273: >> >>> 271: new LShiftINode(in(1), phase->intcon(log2i_exact(abs_con - 1)))); >>> 272: res = new AddINode(in(1), n1); >>> 273: } else if (is_power_of_2(abs_con + 1)) { >> >> So now you only check for `power_of_2 +- 1`, right? But before we also looked at patterns with 2 bits, such as `64 + 8`. >> >> You would really need to prove that this is not a loss on any of the platforms we care about, incl. x64. > > Yes, I have tested the patch on several of x64 machines, including amd Genoa and Intel SPR and some older x86 machines, there's no noticeable performance loss. Test results on aarch64 and amd Genoa were included in the commit message, please take a look, thanks~ > So now you only check for power_of_2 +- 1, right? Yes now I only check these patterns: 1, Const = 1< 0) 2, Const = -(1< 0) 3, Const = (1< 0) 4, Const = (1< 0) 5, Const = -((1< 0) Removed these patterns: 1, Const = (1< 0, n > 0) 2, Const = -((1< 0, n > 0) 3, Const = -((1< 0) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906305352 From epeter at openjdk.org Mon Feb 10 21:25:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Tue, 7 Jan 2025 17:17:22 GMT, Emanuel Peter wrote: >> Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: >> 1. x*8 can be optimized as x<<3. >> 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. >> >> Currently OpenJDK implemented a few such patterns in mid-end, including: >> 1. |C| = 1<0) >> 2. |C| = (1<0) >> 3. |C| = (1<n, n>=0) >> >> The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. >> >> But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: >> 3.1. C = (1<0) >> 3.2. C = -((1<0) >> 3.3. C = (1<n, n>0) >> 3.4. C = -((1<n, n>0) >> >> According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. >> >> Before (MUL is not converted): >> >> mov x1, #C >> mul x2, x1, x0 >> >> >> Now (MUL is converted): >> For 3.1: >> >> add x2, x0, x0, lsl #n >> >> >> For 3.2: >> >> add x2, x0, x0, lsl #n // same cost with mul if n > 4 >> neg x2, x2 >> >> >> For 3.3: >> >> lsl x1, x0, #m >> add x2, x1, x0, lsl #n // same cost with mul if n > 4 >> >> >> For 3.4: >> >> lsl x1, x0, #m >> add x2, x1, x0, lsl #n // same cost with mul if n > 4 >> neg x2, x2 >> >> >> Test results (ns/op) on Arm Neoverse V2: >> >> Before Now Uplift Pattern Notes >> testInt9 103.379 60.702 1.70305756 3.1 >> testIntN33 103.231 106.825 0.96635619 3.2 n > 4 >> testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 >> testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 >> testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 >> testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 >> testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 >> testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 >> testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 >> >> >> **(S1) From this point on, we should treat pattern 3 as follows:** >> 3.1 C = (1<0) >> 3.2 C = -((1<> 3.3 C... > > test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 148: > >> 146: @IR(counts = { IRNode.MUL_I, "1" }) >> 147: private static int addTo6(int a) { >> 148: return a + a + a + a + a + a; > > Is this an improvement? Is this an improvement on aarch64 for all implementations? What about x64? > test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java line 1059: > >> 1057: b[i * 6 + 1] = (byte) (a[i * 6 + 1] & mask); >> 1058: b[i * 6 + 2] = (byte) (a[i * 6 + 2] & mask); >> 1059: b[i * 6 + 3] = (byte) (a[i * 6 + 3] & mask); > > Why did this change? > > Was `i * 6` not supposed to be changed to `(i << 2) + (i << 1)`? This is the reason why the current impl of VPointer is not parsing this right, but after my patch https://github.com/openjdk/jdk/pull/21926 this will be fixed because we will parse the multiple occurances of `i` properly. > > So it looks like you are now sometimes keeping it at `i * 6` instead of splitting it. Why? Also: this is very confusing: why does the result differ depending on `AlignVector`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905810588 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1905811309 From duke at openjdk.org Mon Feb 10 21:25:01 2025 From: duke at openjdk.org (erifan) Date: Mon, 10 Feb 2025 21:25:01 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Tue, 7 Jan 2025 17:19:22 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 148: >> >>> 146: @IR(counts = { IRNode.MUL_I, "1" }) >>> 147: private static int addTo6(int a) { >>> 148: return a + a + a + a + a + a; >> >> Is this an improvement? > > Is this an improvement on aarch64 for all implementations? What about x64? If `a*6` is in a loop and can be vectorized, there may be big performance improvement. If it's not vectorized, there may be some small performance loss. See the test results of `a*18 = (a<<4) + (a<<1)`, (same with `a*6 = (a<<2) + (a<<1)`) in three different cases: Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Notes testInt18 98.90 102.94 0.96 142.48 140.75 1.01 scalar testInt18AddSum 68.70 48.10 1.42 26.88 16.78 1.6 vectorized testInt18Store 41.31 43.39 0.95 21.23 20.88 1.01 vectorized We can see that for scalar case the conversion from `a*6 => (a<<2) + (a<<1)` is profitable on aarch64, I have a follow up patch to reimplement this pattern in aarch64 backend, I'll file it later. But for x64, there is no obvious performance change whether or not to do this conversion. So this is also why I leave a TODO in [mulnode.cpp](https://github.com/openjdk/jdk/pull/22922/files/193dc4e5760007784cffd64ef14e0050b0be92b3#diff-b1bd52f0743843e15452764f48ff43c15dd3192a28bfb684b34149f0e964996e) >> test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java line 1059: >> >>> 1057: b[i * 6 + 1] = (byte) (a[i * 6 + 1] & mask); >>> 1058: b[i * 6 + 2] = (byte) (a[i * 6 + 2] & mask); >>> 1059: b[i * 6 + 3] = (byte) (a[i * 6 + 3] & mask); >> >> Why did this change? >> >> Was `i * 6` not supposed to be changed to `(i << 2) + (i << 1)`? This is the reason why the current impl of VPointer is not parsing this right, but after my patch https://github.com/openjdk/jdk/pull/21926 this will be fixed because we will parse the multiple occurances of `i` properly. >> >> So it looks like you are now sometimes keeping it at `i * 6` instead of splitting it. Why? > > Also: this is very confusing: why does the result differ depending on `AlignVector`? Without this patch, since the VPointer issue you mentioned this loop does not vectorize at all. With this patch, `i*6` is not changed to `(i<<2) + (i<<1)`, so the VPointer issue is bypassed. So if `AlignVector` is false, this loop will be vecotized. > why does the result differ depending on `AlignVector` ? Because this loop operates on a discontinuous and unaligned array address space. If we require aligned vector (that is `AlignVector` is true), this loop will not be vectorized, otherwise it will be vectorized. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906240924 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906259743 From epeter at openjdk.org Mon Feb 10 21:25:02 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Feb 2025 21:25:02 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() In-Reply-To: References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: On Wed, 8 Jan 2025 02:06:11 GMT, erifan wrote: >> Is this an improvement on aarch64 for all implementations? What about x64? > > If `a*6` is in a loop and can be vectorized, there may be big performance improvement. If it's not vectorized, there may be some small performance loss. See the test results of `a*18 = (a<<4) + (a<<1)`, (same with `a*6 = (a<<2) + (a<<1)`) in three different cases: > > > Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Notes > testInt18 98.90 102.94 0.96 142.48 140.75 1.01 scalar > testInt18AddSum 68.70 48.10 1.42 26.88 16.78 1.6 vectorized > testInt18Store 41.31 43.39 0.95 21.23 20.88 1.01 vectorized > > > We can see that for scalar case the conversion from `a*6 => (a<<2) + (a<<1)` is profitable on aarch64, I have a follow up patch to reimplement this pattern in aarch64 backend, I'll file it later. But for x64, there is no obvious performance change whether or not to do this conversion. So this is also why I leave a TODO in [mulnode.cpp](https://github.com/openjdk/jdk/pull/22922/files/193dc4e5760007784cffd64ef14e0050b0be92b3#diff-b1bd52f0743843e15452764f48ff43c15dd3192a28bfb684b34149f0e964996e) Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Notes testInt18 98.90 102.94 0.96 142.48 140.75 1.01 scalar Ok, that would be a 4% regression on V2. It is not much, but still possibly relevant. I think I would need to see a clear strategy that we can actually pull off. Otherwise it may be that you introduce a regression here, that then nobody gets around to fixing later. >> Also: this is very confusing: why does the result differ depending on `AlignVector`? > > Without this patch, since the VPointer issue you mentioned this loop does not vectorize at all. > With this patch, `i*6` is not changed to `(i<<2) + (i<<1)`, so the VPointer issue is bypassed. So if `AlignVector` is false, this loop will be vecotized. > >> why does the result differ depending on `AlignVector` ? > > Because this loop operates on a discontinuous and unaligned array address space. If we require aligned vector (that is `AlignVector` is true), this loop will not be vectorized, otherwise it will be vectorized. Right, that makes sense. I actually thought about that too after work yesterday evening. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906565583 PR Review Comment: https://git.openjdk.org/jdk/pull/22922#discussion_r1906561384 From psandoz at openjdk.org Mon Feb 10 21:26:25 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 10 Feb 2025 21:26:25 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fixing typos An impressive and substantial change. I focused on the Java code, there are some small tweaks, presented in comments, we can make to the intrinsics to improve the expression of code, and it has no impact on the intrinsic implementation. src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 32: > 30: * The class {@code Float16Math} constains intrinsic entry points corresponding > 31: * to scalar numeric operations defined in Float16 class. > 32: * @since 25 You can remove this line, since this is an internal class. src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 38: > 36: } > 37: > 38: public interface Float16UnaryMathOp { You can just use `UnaryOperator`, no need for a new type, here are the updated methods you can apply to this class. @FunctionalInterface public interface TernaryOperator { T apply(T a, T b, T c); } @IntrinsicCandidate public static T sqrt(Class box_class, T oa, UnaryOperator defaultImpl) { assert isNonCapturingLambda(defaultImpl) : defaultImpl; return defaultImpl.apply(oa); } @IntrinsicCandidate public static T fma(Class box_class, T oa, T ob, T oc, TernaryOperator defaultImpl) { assert isNonCapturingLambda(defaultImpl) : defaultImpl; return defaultImpl.apply(oa, ob, oc); } static boolean isNonCapturingLambda(Object o) { return o.getClass().getDeclaredFields().length == 0; } And in `src/hotspot/share/classfile/vmIntrinsics.hpp`: /* Float16Math API intrinsification support */ \ /* Float16 signatures */ \ do_signature(float16_unary_math_op_sig, "(Ljava/lang/Class;" \ "Ljava/lang/Object;" \ "Ljava/util/function/UnaryOperator;)" \ "Ljava/lang/Object;") \ do_signature(float16_ternary_math_op_sig, "(Ljava/lang/Class;" \ "Ljava/lang/Object;" \ "Ljava/lang/Object;" \ "Ljava/lang/Object;" \ "Ljdk/internal/vm/vector/Float16Math$TernaryOperator;)" \ "Ljava/lang/Object;") \ do_intrinsic(_sqrt_float16, jdk_internal_vm_vector_Float16Math, sqrt_name, float16_unary_math_op_sig, F_S) \ do_intrinsic(_fma_float16, jdk_internal_vm_vector_Float16Math, fma_name, float16_ternary_math_op_sig, F_S) \ src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java line 1202: > 1200: */ > 1201: public static Float16 sqrt(Float16 radicand) { > 1202: return (Float16) Float16Math.sqrt(Float16.class, radicand, With changes to the intrinsics (as presented in another comment) you no longer need explicit casts and the code is precisely the same as before except embedded in a lambda body: public static Float16 sqrt(Float16 radicand) { return Float16Math.sqrt(Float16.class, radicand, (_radicand) -> { // Rounding path of sqrt(Float16 -> double) -> Float16 is fine // for preserving the correct final value. The conversion // Float16 -> double preserves the exact numerical value. The // conversion of double -> Float16 also benefits from the // 2p+2 property of IEEE 754 arithmetic. return valueOf(Math.sqrt(_radicand.doubleValue())); } ); } Similarly for `fma`: return Float16Math.fma(Float16.class, a, b, c, (_a, _b, _c) -> { // product is numerically exact in float before the cast to // double; not necessary to widen to double before the // multiply. double product = (double)(_a.floatValue() * _b.floatValue()); return valueOf(product + _c.doubleValue()); }); test/jdk/jdk/incubator/vector/ScalarFloat16OperationsTest.java line 44: > 42: import static jdk.incubator.vector.Float16.*; > 43: > 44: public class ScalarFloat16OperationsTest { Now that we have IR tests do you still think this test is necessary or should we have more IR test instead? @eme64 thoughts? We could follow up in another PR if need be. ------------- PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2607094727 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949842011 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949871647 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949847574 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949858554 From sviswanathan at openjdk.org Tue Feb 11 02:46:10 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 11 Feb 2025 02:46:10 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 21:34:39 GMT, Mohamed Issa wrote: > A fix for incorrectly defined program segments in Windows SVML assembly. > > - Changes _READ_ to _READONLY_ in all math functions > - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) > > The tier1 tests show the changes didn't introduce new failures. Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23503#pullrequestreview-2607579720 From xgong at openjdk.org Tue Feb 11 03:02:14 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 11 Feb 2025 03:02:14 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Wed, 8 Jan 2025 09:04:47 GMT, Nicole Xu wrote: > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: > > > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > > > The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. > > Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. > > Additionally, some defined but unused variables have been removed. Looks good to me. Thanks for the fixing! ------------- Marked as reviewed by xgong (Committer). PR Review: https://git.openjdk.org/jdk/pull/22963#pullrequestreview-2607593816 From duke at openjdk.org Tue Feb 11 03:03:29 2025 From: duke at openjdk.org (erifan) Date: Tue, 11 Feb 2025 03:03:29 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() [v2] In-Reply-To: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: > Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: > 1. x*8 can be optimized as x<<3. > 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. > > Currently OpenJDK implemented a few such patterns in mid-end, including: > 1. |C| = 1<0) > 2. |C| = (1<0) > 3. |C| = (1<n, n>=0) > > The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. > > But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: > 3.1. C = (1<0) > 3.2. C = -((1<0) > 3.3. C = (1<n, n>0) > 3.4. C = -((1<n, n>0) > > According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. > > Before (MUL is not converted): > > mov x1, #C > mul x2, x1, x0 > > > Now (MUL is converted): > For 3.1: > > add x2, x0, x0, lsl #n > > > For 3.2: > > add x2, x0, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > For 3.3: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > > > For 3.4: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > Test results (ns/op) on Arm Neoverse V2: > > Before Now Uplift Pattern Notes > testInt9 103.379 60.702 1.70305756 3.1 > testIntN33 103.231 106.825 0.96635619 3.2 n > 4 > testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 > testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 > testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 > testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 > testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 > testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 > testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 > > > **(S1) From this point on, we should treat pattern 3 as follows:** > 3.1 C = (1<0) > 3.2 C = -((1< 3.3 C = (1<n, 0 3.4 C = -((1< > Since this conversion is implemented in mid-end, it impacts... erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge branch 'master' into JDK-8346964 Change-Id: Ib47ed4f9c6d69326a0b7cb8ba7c29f604b8fc1ec - 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: 1. x*8 can be optimized as x<<3. 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. Currently OpenJDK implemented a few such patterns in mid-end, including: 1. |C| = 1<0) 2. |C| = (1<0) 3. |C| = (1<n, n>=0) The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: 3.1. C = (1<0) 3.2. C = -((1<0) 3.3. C = (1<n, n>0) 3.4. C = -((1<n, n>0) According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. Before (MUL is not converted): ``` mov x1, #C mul x2, x1, x0 ``` Now (MUL is converted): For 3.1: ``` add x2, x0, x0, lsl #n ``` For 3.2: ``` add x2, x0, x0, lsl #n // same cost with mul if n > 4 neg x2, x2 ``` For 3.3: ``` lsl x1, x0, #m add x2, x1, x0, lsl #n // same cost with mul if n > 4 ``` For 3.4: ``` lsl x1, x0, #m add x2, x1, x0, lsl #n // same cost with mul if n > 4 neg x2, x2 ``` Test results (ns/op) on Arm Neoverse V2: ``` Before Now Uplift Pattern Notes testInt9 103.379 60.702 1.70305756 3.1 testIntN33 103.231 106.825 0.96635619 3.2 n > 4 testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 ``` **(S1) From this point on, we should treat pattern 3 as follows:** 3.1 C = (1<0) 3.2 C = -((1<n, 00) 3.2 C = -((1<n, n>0) 3.4 C = -((1<0, 1.7) (disable, 0.75) (n>0, 1.3) 3.2 (0n, 0n, n>0, 1.03) 3.4 (disable, 0.74) (disable, 0.30) (disable, 0.74) For 3.1, it's similar with pattern 2, usually be lowered as only one instruction, so we tend to keep it in mid-end. For 3.2, we tend to disable it in mid-end, and do S1 in back-end if it's profitable. For 3.3, although S3 has 3% performance gain, but S2 has 31% performance regression. So we tend to disable it in mid-end and redo S1 in back-end. For 3.4, we shouldn't do this optimization anywhere. In theory, auto-vectorization should be able to generate the best vectorized code, and cases that cannot be vectorized will be converted into other more optimal scalar instructions in the architecture backend (this is what gcc and llvm do). However, we currently do not have a cost model and vplan, and the results of auto-vectorization are significantly affected by its input. Therefore, this patch turns off pattern 3.2, 3.3 and 3.4 in mid-end. Then if it's profitable, implement these patterns in the backend. If we implement a cost model and vplan in the future, it is best to move all patterns to the backend, this patch does not conflict with this direction. I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines, No noticeable performance degradation was seen on any of the machines. Here are the test results on an Arm V2 and an AMD Genoa machine: ``` Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Pattern Notes testInt8 60.36989 60.276736 1 116.768294 116.772547 0.99 1 testInt8AddSum 63.658064 63.797732 0.99 16.04973 16.051491 0.99 1 testInt8Store 38.829618 39.054129 0.99 19.857453 20.006321 0.99 1 testIntN8 59.99655 60.150053 0.99 132.269926 132.252473 1 1 testIntN8AddSum 145.678098 146.181549 0.99 158.546226 158.806476 0.99 1 testIntN8Store 32.802445 32.897907 0.99 19.047873 19.065941 0.99 1 testInt7 98.978213 99.176574 0.99 114.07026 113.08989 1 2 testInt7AddSum 62.675636 62.310799 1 23.370851 20.971655 1.11 2 testInt7Store 32.850828 32.923315 0.99 23.884952 23.628681 1.01 2 testIntN7 60.27949 60.668158 0.99 174.224893 174.102295 1 2 testIntN7AddSum 62.746696 62.288476 1 20.93192 20.964557 0.99 2 testIntN7Store 32.812906 32.851355 0.99 23.810024 23.526074 1.01 2 testInt9 60.820402 60.331938 1 108.850777 108.846161 1 3.1 testInt9AddSum 62.24679 62.374637 0.99 20.698749 20.741137 0.99 3.1 testInt9Store 32.871723 32.912065 0.99 19.055537 19.080735 0.99 3.1 testIntN33 106.517618 103.450746 1.02 153.894345 140.641135 1.09 3.2 n > 4 testIntN33AddSum 147.589815 47.911612 3.08 153.851885 17.008453 9.04 3.2 testIntN33Store 75.434513 43.473053 1.73 26.612181 20.436323 1.3 3.2 testIntN9 102.173268 103.70682 0.98 155.858169 140.718967 1.1 3.2 n <= 4 testIntN9AddSum 148.724952 47.963305 3.1 186.902111 20.249414 9.23 3.2 testIntN9Store 74.783788 43.339188 1.72 20.150159 20.888448 0.96 3.2 testInt18 98.905625 102.942092 0.96 142.480636 140.748778 1.01 3.3 m <= 4, n <= 4 testInt18AddSum 68.695585 48.103536 1.42 26.88524 16.77886 1.6 3.3 testInt18Store 41.307909 43.385183 0.95 21.233238 20.875026 1.01 3.3 testInt36 99.039742 103.714745 0.95 142.265806 142.334039 0.99 3.3 m > 4, n <= 4 testInt36AddSum 68.736756 47.952189 1.43 26.868362 17.030035 1.57 3.3 testInt36Store 41.403698 43.414093 0.95 21.225454 20.52266 1.03 3.3 testInt96 105.00287 103.528144 1.01 237.649526 140.643255 1.68 3.3 m > 4, n > 4 testInt96AddSum 68.481133 48.04549 1.42 26.877407 16.918209 1.58 3.3 testInt96Store 41.276292 43.512994 0.94 23.456117 20.540181 1.14 3.3 testIntN18 138.629044 103.269657 1.34 210.315628 140.716818 1.49 3.4 m <= 4, n <= 4 testIntN18AddSum 156.635652 48.003989 3.26 215.807135 16.917665 12.75 3.4 testIntN18Store 57.584487 43.410415 1.32 26.819827 20.707778 1.29 3.4 testIntN36 139.068861 103.766774 1.34 209.522432 140.720322 1.48 3.4 m > 4, n <= 4 testIntN36AddSum 156.36928 48.027779 3.25 215.705842 16.893192 12.76 3.4 testIntN36Store 57.715418 43.493958 1.32 21.651252 20.676877 1.04 3.4 testIntN96 139.151761 103.453665 1.34 269.254161 140.753499 1.91 3.4 m > 4, n > 4 testIntN96AddSum 153.123557 48.110524 3.18 263.262635 17.011144 15.47 3.4 testIntN96Store 57.793179 43.47574 1.32 24.444592 20.530219 1.19 3.4 ``` limitations: 1, This patch only analyzes two vector cases, there may be other vector cases that may get performance regression with this patch. 2, This patch does not implement the disabled patterns in the backend, I will propose a follow-up patch to implement these patterns in the aarch64 backend. 3, This patch does not handle the long type, because different architectures have different auto-vectorization support for long type, resulting in very different performance, and it is difficult to find a solution that does not introduce significant performance degradation. ------------- Changes: https://git.openjdk.org/jdk/pull/22922/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22922&range=01 Stats: 706 lines in 5 files changed: 680 ins; 14 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/22922.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22922/head:pull/22922 PR: https://git.openjdk.org/jdk/pull/22922 From duke at openjdk.org Tue Feb 11 03:26:56 2025 From: duke at openjdk.org (erifan) Date: Tue, 11 Feb 2025 03:26:56 GMT Subject: RFR: 8346964: C2: Improve integer multiplication with constant in MulINode::Ideal() [v3] In-Reply-To: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> References: <4UC1x1GPJCcIwPXKJZfiUGxQnuRaDQjOcN53wYmUzF4=.fafd71c1-2f48-4ae4-8e7e-8844c578429a@github.com> Message-ID: > Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB instructions since generally these instructions have smaller latency and larger throughput on most architectures. For example: > 1. x*8 can be optimized as x<<3. > 2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on some architectures, like aarch64 and x86_64. > > Currently OpenJDK implemented a few such patterns in mid-end, including: > 1. |C| = 1<0) > 2. |C| = (1<0) > 3. |C| = (1<n, n>=0) > > The first two are ok. Because on most architectures they are lowered as only one ADD/SUB/SHIFT instruction. > > But the third pattern doesn't always perform well on some architectures, such as aarch64. The third pattern can be split as the following sub patterns: > 3.1. C = (1<0) > 3.2. C = -((1<0) > 3.3. C = (1<n, n>0) > 3.4. C = -((1<n, n>0) > > According to Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. So in this case, converting MUL to ADD is not profitable. Take a[i] * C on aarch64 as an example. > > Before (MUL is not converted): > > mov x1, #C > mul x2, x1, x0 > > > Now (MUL is converted): > For 3.1: > > add x2, x0, x0, lsl #n > > > For 3.2: > > add x2, x0, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > For 3.3: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > > > For 3.4: > > lsl x1, x0, #m > add x2, x1, x0, lsl #n // same cost with mul if n > 4 > neg x2, x2 > > > Test results (ns/op) on Arm Neoverse V2: > > Before Now Uplift Pattern Notes > testInt9 103.379 60.702 1.70305756 3.1 > testIntN33 103.231 106.825 0.96635619 3.2 n > 4 > testIntN9 103.448 103.005 1.004300762 3.2 n <= 4 > testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4 > testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4 > testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4 > testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4 > testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4 > testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4 > > > **(S1) From this point on, we should treat pattern 3 as follows:** > 3.1 C = (1<0) > 3.2 C = -((1< 3.3 C = (1<n, 0 3.4 C = -((1< > Since this conversion is implemented in mid-end, it impacts... erifan has updated the pull request incrementally with one additional commit since the last revision: Merge master ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22922/files - new: https://git.openjdk.org/jdk/pull/22922/files/011c6101..37d0d5ef Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22922&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22922&range=01-02 Stats: 258 lines in 1 file changed: 0 ins; 258 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/22922.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22922/head:pull/22922 PR: https://git.openjdk.org/jdk/pull/22922 From duke at openjdk.org Tue Feb 11 03:44:27 2025 From: duke at openjdk.org (Mohamed Issa) Date: Tue, 11 Feb 2025 03:44:27 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification [v2] In-Reply-To: References: Message-ID: > A fix for incorrectly defined program segments in Windows SVML assembly. > > - Changes _READ_ to _READONLY_ in all math functions > - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) > > The tier1 tests show the changes didn't introduce new failures. Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Update full name ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23503/files - new: https://git.openjdk.org/jdk/pull/23503/files/6d8e8a5d..f29de5c1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23503&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23503&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23503.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23503/head:pull/23503 PR: https://git.openjdk.org/jdk/pull/23503 From jbhateja at openjdk.org Tue Feb 11 06:32:56 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Feb 2025 06:32:56 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: > Hi All, > > This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) > > Following is the summary of changes included with this patch:- > > 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. > 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. > 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. > - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. > 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. > 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. > 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. > 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF > 9. X86 backend implementation for all supported intrinsics. > 10. Functional and Performance validation tests. > > Kindly review the patch and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22754/files - new: https://git.openjdk.org/jdk/pull/22754/files/82a42213..111c8084 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=16-17 Stats: 38 lines in 3 files changed: 2 ins; 11 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/22754.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22754/head:pull/22754 PR: https://git.openjdk.org/jdk/pull/22754 From jbhateja at openjdk.org Tue Feb 11 06:32:56 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Feb 2025 06:32:56 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: On Mon, 10 Feb 2025 20:43:19 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixing typos > > test/jdk/jdk/incubator/vector/ScalarFloat16OperationsTest.java line 44: > >> 42: import static jdk.incubator.vector.Float16.*; >> 43: >> 44: public class ScalarFloat16OperationsTest { > > Now that we have IR tests do you still think this test is necessary or should we have more IR test instead? @eme64 thoughts? We could follow up in another PR if need be. Hi Paul, DataProviders used in this Functional validation test exercises each newly added Float16 operation over entire value range, while our IR tests are more directed towards valdating the newly added IR transforms and constant folding scenarios. We have a follow-up PR for auto-vectorizing Float16 operation which can be used to beefup any validation gap. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1950290083 From gcao at openjdk.org Tue Feb 11 07:34:41 2025 From: gcao at openjdk.org (Gui Cao) Date: Tue, 11 Feb 2025 07:34:41 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic Message-ID: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. ### JMH numbers (tested on milkv megrez with hotspot client build): #### before this patch: Benchmark Mode Cnt Score Error Units SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op SecondarySupersLookup.testPositive03 avgt 15 63.469 ? 0.080 ns/op SecondarySupersLookup.testPositive04 avgt 15 63.896 ? 1.008 ns/op SecondarySupersLookup.testPositive05 avgt 15 63.457 ? 0.035 ns/op SecondarySupersLookup.testPositive06 avgt 15 63.235 ? 0.261 ns/op SecondarySupersLookup.testPositive07 avgt 15 63.455 ? 0.022 ns/op SecondarySupersLookup.testPositive08 avgt 15 63.672 ? 0.480 ns/op SecondarySupersLookup.testPositive09 avgt 15 63.458 ? 0.028 ns/op SecondarySupersLookup.testPositive10 avgt 15 63.365 ? 0.220 ns/op SecondarySupersLookup.testPositive16 avgt 15 63.279 ? 0.278 ns/op SecondarySupersLookup.testPositive20 avgt 15 63.239 ? 0.309 ns/op SecondarySupersLookup.testPositive30 avgt 15 63.357 ? 0.238 ns/op SecondarySupersLookup.testPositive32 avgt 15 63.902 ? 0.981 ns/op SecondarySupersLookup.testPositive40 avgt 15 61.624 ? 1.551 ns/op SecondarySupersLookup.testPositive50 avgt 15 63.347 ? 0.217 ns/op SecondarySupersLookup.testPositive60 avgt 15 61.963 ? 0.848 ns/op SecondarySupersLookup.testPositive63 avgt 15 114.361 ? 1.086 ns/op SecondarySupersLookup.testPositive64 avgt 15 129.291 ? 0.228 ns/op Finished running test 'micro:vm.lang.SecondarySupersLookup' #### apply this patch: Benchmark Mode Cnt Score Error Units SecondarySupersLookup.testNegative00 avgt 15 29.739 ? 1.085 ns/op SecondarySupersLookup.testNegative01 avgt 15 29.377 ? 0.704 ns/op SecondarySupersLookup.testNegative02 avgt 15 30.091 ? 0.994 ns/op SecondarySupersLookup.testNegative03 avgt 15 30.293 ? 0.743 ns/op SecondarySupersLookup.testNegative04 avgt 15 30.305 ? 0.750 ns/op SecondarySupersLookup.testNegative05 avgt 15 31.173 ? 1.826 ns/op SecondarySupersLookup.testNegative06 avgt 15 30.355 ? 1.294 ns/op SecondarySupersLookup.testNegative07 avgt 15 29.967 ? 1.481 ns/op SecondarySupersLookup.testNegative08 avgt 15 30.003 ? 0.914 ns/op SecondarySupersLookup.testNegative09 avgt 15 29.869 ? 0.947 ns/op SecondarySupersLookup.testNegative10 avgt 15 29.764 ? 0.807 ns/op SecondarySupersLookup.testNegative16 avgt 15 29.849 ? 0.922 ns/op SecondarySupersLookup.testNegative20 avgt 15 29.331 ? 0.775 ns/op SecondarySupersLookup.testNegative30 avgt 15 29.624 ? 0.894 ns/op SecondarySupersLookup.testNegative32 avgt 15 29.981 ? 0.897 ns/op SecondarySupersLookup.testNegative40 avgt 15 30.317 ? 1.399 ns/op SecondarySupersLookup.testNegative50 avgt 15 30.643 ? 0.046 ns/op SecondarySupersLookup.testNegative55 avgt 15 39.532 ? 1.233 ns/op SecondarySupersLookup.testNegative56 avgt 15 39.051 ? 0.195 ns/op SecondarySupersLookup.testNegative57 avgt 15 40.058 ? 1.319 ns/op SecondarySupersLookup.testNegative58 avgt 15 39.070 ? 0.112 ns/op SecondarySupersLookup.testNegative59 avgt 15 39.045 ? 0.101 ns/op SecondarySupersLookup.testNegative60 avgt 15 47.358 ? 0.047 ns/op SecondarySupersLookup.testNegative61 avgt 15 48.375 ? 1.612 ns/op SecondarySupersLookup.testNegative62 avgt 15 47.392 ? 0.107 ns/op SecondarySupersLookup.testNegative63 avgt 15 102.903 ? 0.471 ns/op SecondarySupersLookup.testNegative64 avgt 15 103.617 ? 0.071 ns/op SecondarySupersLookup.testPositive01 avgt 15 39.182 ? 0.305 ns/op SecondarySupersLookup.testPositive02 avgt 15 39.161 ? 0.313 ns/op SecondarySupersLookup.testPositive03 avgt 15 39.292 ? 0.304 ns/op SecondarySupersLookup.testPositive04 avgt 15 38.955 ? 0.394 ns/op SecondarySupersLookup.testPositive05 avgt 15 39.074 ? 0.242 ns/op SecondarySupersLookup.testPositive06 avgt 15 39.411 ? 0.979 ns/op SecondarySupersLookup.testPositive07 avgt 15 38.907 ? 0.945 ns/op SecondarySupersLookup.testPositive08 avgt 15 38.956 ? 0.013 ns/op SecondarySupersLookup.testPositive09 avgt 15 38.843 ? 0.242 ns/op SecondarySupersLookup.testPositive10 avgt 15 38.763 ? 0.276 ns/op SecondarySupersLookup.testPositive16 avgt 15 39.396 ? 0.476 ns/op SecondarySupersLookup.testPositive20 avgt 15 39.187 ? 0.291 ns/op SecondarySupersLookup.testPositive30 avgt 15 39.056 ? 0.242 ns/op SecondarySupersLookup.testPositive32 avgt 15 39.180 ? 0.305 ns/op SecondarySupersLookup.testPositive40 avgt 15 38.521 ? 0.255 ns/op SecondarySupersLookup.testPositive50 avgt 15 38.957 ? 0.007 ns/op SecondarySupersLookup.testPositive60 avgt 15 38.415 ? 0.020 ns/op SecondarySupersLookup.testPositive63 avgt 15 88.788 ? 1.223 ns/op SecondarySupersLookup.testPositive64 avgt 15 104.534 ? 1.687 ns/op ### Testing - [x] Run tier1-3 tests on SOPHON SG2042 (release) - [x] Run tier1-3 tests on MILK-V MEGREZ (release) ------------- Commit messages: - Polishing - RISC-V: C1: Improve Class.isInstance intrinsic Changes: https://git.openjdk.org/jdk/pull/23551/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23551&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349764 Stats: 46 lines in 2 files changed: 45 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23551.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23551/head:pull/23551 PR: https://git.openjdk.org/jdk/pull/23551 From rrich at openjdk.org Tue Feb 11 07:48:15 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 11 Feb 2025 07:48:15 GMT Subject: Integrated: 8348678: [PPC64] C2: unaligned vector load/store is ok In-Reply-To: References: Message-ID: On Mon, 27 Jan 2025 16:04:33 GMT, Richard Reingruber wrote: > This pr changes `Matcher::misaligned_vectors_ok` to return `true` on PPC64 for better vectorization during `SuperWord`. > IR checking of the corresponding test `TestCastX2NotProcessedIGVN.java` is also enabled. > > Tested with `TestCastX2NotProcessedIGVN.java` > > The change passed our CI testing: > Tier 1-4 of hotspot and jdk. All of langtools and jaxp. Renaissance Suite and SAP specific tests. > Testing was done on the main platforms and also on Linux/PPC64le and AIX. This pull request has now been integrated. Changeset: 1a8212e1 Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/1a8212e1018744b360df310e85fc29f8c41f5072 Stats: 12 lines in 3 files changed: 5 ins; 0 del; 7 mod 8348678: [PPC64] C2: unaligned vector load/store is ok 8343906: test2 of compiler/c2/TestCastX2NotProcessedIGVN.java fails on some platforms Reviewed-by: mdoerr, amitkumar ------------- PR: https://git.openjdk.org/jdk/pull/23318 From rrich at openjdk.org Tue Feb 11 07:48:15 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 11 Feb 2025 07:48:15 GMT Subject: RFR: 8348678: [PPC64] C2: unaligned vector load/store is ok [v4] In-Reply-To: References: Message-ID: On Wed, 29 Jan 2025 22:05:04 GMT, Richard Reingruber wrote: >> This pr changes `Matcher::misaligned_vectors_ok` to return `true` on PPC64 for better vectorization during `SuperWord`. >> IR checking of the corresponding test `TestCastX2NotProcessedIGVN.java` is also enabled. >> >> Tested with `TestCastX2NotProcessedIGVN.java` >> >> The change passed our CI testing: >> Tier 1-4 of hotspot and jdk. All of langtools and jaxp. Renaissance Suite and SAP specific tests. >> Testing was done on the main platforms and also on Linux/PPC64le and AIX. > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Copyright header Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23318#issuecomment-2650037850 From jbhateja at openjdk.org Tue Feb 11 07:53:09 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Feb 2025 07:53:09 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification [v2] In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 03:44:27 GMT, Mohamed Issa wrote: >> A fix for incorrectly defined program segments in Windows SVML assembly. >> >> - Changes _READ_ to _READONLY_ in all math functions >> - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) >> >> The tier1 tests show the changes didn't introduce new failures. > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Update full name dumpbin shows no difference in section attributes with and without patch. But, its good to comply with specification. ``` dumpbin /nologo /SECTION:_RDATA jsvml.dll 40000040 flags Initialized Data Read Only ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23503#pullrequestreview-2607929875 From jwaters at openjdk.org Tue Feb 11 08:07:11 2025 From: jwaters at openjdk.org (Julian Waters) Date: Tue, 11 Feb 2025 08:07:11 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification [v2] In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 03:44:27 GMT, Mohamed Issa wrote: >> A fix for incorrectly defined program segments in Windows SVML assembly. >> >> - Changes _READ_ to _READONLY_ in all math functions >> - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) >> >> The tier1 tests show the changes didn't introduce new failures. > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Update full name The microsoft toolchain still creates a literal section named _RDATA even with this change? Guess it's a fault in the toolchain and nothing to do with not conforming to the microsoft docs then. Still, like you said, it's good to follow what microsoft says in their docs ------------- PR Comment: https://git.openjdk.org/jdk/pull/23503#issuecomment-2650067731 From rcastanedalo at openjdk.org Tue Feb 11 08:20:14 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 11 Feb 2025 08:20:14 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: <-KC83AQUZqpnDH0lG1yvKaVUV3H5kSN8cQLU1x4e06o=.dbe58228-89fe-4128-9cdb-3953d9215d59@github.com> References: <-KC83AQUZqpnDH0lG1yvKaVUV3H5kSN8cQLU1x4e06o=.dbe58228-89fe-4128-9cdb-3953d9215d59@github.com> Message-ID: On Mon, 10 Feb 2025 15:53:22 GMT, Daniel Lund?n wrote: > It does look like the problematic memory subgraph results due to loop peeling OK, that sounds promising! Maybe it is indeed possible to make peeling/cloning maintain our invariant right from the start, and hope (and verify) it is not broken by other transformations. Up to you whether to integrate this point fix and continue your investigation separately or wait until you have explored along this line before integration. > Sounds like a great idea, but I think we need to discuss the details further first. It is not quite clear to me yet what it is we want to assert. Right, the details are not obvious to me either, it would probably require some exploration before we can formalize what it is exactly that we want to verify, since there is no specification (as far as I know) of what is expected for the memory subgraph in terms of liveness and interference. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1950405760 From rcastanedalo at openjdk.org Tue Feb 11 10:30:29 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 11 Feb 2025 10:30:29 GMT Subject: RFR: 8348645: IGV: visualize live ranges Message-ID: This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 produces the following visualization for the `Initial spilling` phase: ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) Live ranges are first-class IGV entities, meaning that the user can: - search, select, and extract them; ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) - examine their properties in the `Properties` window or via tooltips; ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) - navigate to related IGV entities via a pop-up menu; and ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) - program filters that act om them according to their properties. ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c) The changeset extends the IGV graph printing logic in HotSpot to emit basic block-level liveness information and live range properties such as associated register mask, score, and type. IGV propagates then the block-level liveness information down to individual nodes. Passing only basic block-level liveness information makes the graph serialization compact, limiting the size increase of the corresponding graphs to around 25%. The IGV changes do not affect layout performance significantly in the sea-of-nodes view, and only introduce a moderate overhead (of around 10%) when displaying graphs with associated live ranges in the control-flow graph view. Thanks to Damon Fenacci and Daniel Lund?n for providing valuable feedback! #### Testing - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs using different views and filter combinations does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). ------------- Commit messages: - Fix def live range printing in 'Show liveness information' filter - Ensure basic blocks with live ranges are always visible - Ensure live ranges to go to from pop-up menu become visible - Select all live range sements and not just a representative one - Remove comments that are no longer needed - Remove unused option to use definer node ids instead of live range ids - Update copyright headers - Let phi-defined live ranges start at the top of the basic block - Draw segments in empty blocks - Complete dump of live range properties - ... and 32 more: https://git.openjdk.org/jdk/compare/30f71622...c5e48e46 Changes: https://git.openjdk.org/jdk/pull/23558/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23558&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8348645 Stats: 2430 lines in 52 files changed: 2236 ins; 108 del; 86 mod Patch: https://git.openjdk.org/jdk/pull/23558.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23558/head:pull/23558 PR: https://git.openjdk.org/jdk/pull/23558 From thartmann at openjdk.org Tue Feb 11 12:43:21 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 11 Feb 2025 12:43:21 GMT Subject: RFR: 8349820: Temporarily increase MemLimit for tests until JDK-8349772 and JDK-8337821 are fixed Message-ID: Let's increase the MemLimit for the following tests until [JDK-8349772](https://bugs.openjdk.org/browse/JDK-8349772) and [JDK-8337821](https://bugs.openjdk.org/browse/JDK-8337821) are fixed: test/hotspot/jtreg/vmTestbase/jit/t/t105/t105.java test/hotspot/jtreg/vmTestbase/vm/mlvm/meth/stress/compiler/i2c_c2i/Test.java Thanks, Tobias ------------- Commit messages: - 8349820: Temporarily increase MemLimit for tests until JDK-8349772 and JDK-8337821 are fixed Changes: https://git.openjdk.org/jdk/pull/23561/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23561&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349820 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23561.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23561/head:pull/23561 PR: https://git.openjdk.org/jdk/pull/23561 From jsjolen at openjdk.org Tue Feb 11 12:45:25 2025 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Tue, 11 Feb 2025 12:45:25 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 09:59:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. > > Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: > > > java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 > > > produces the following visualization for the `Initial spilling` phase: > > ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) > > Live ranges are first-class IGV entities, meaning that the user can: > > - search, select, and extract them; > > ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) > > - examine their properties in the `Properties` window or via tooltips; > > ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) > > - navigate to related IGV entities via a pop-up menu; and > > ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) > > - program filters that act om them according to their properties. > > ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) > > Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: > > ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c) > > The changeset extends the IGV graph printing logic in HotSpot t... src/hotspot/share/opto/idealGraphPrinter.cpp line 1004: > 1002: if (lrg._msize_valid && lrg._degree_valid && lrg.lo_degree()) { > 1003: print_prop("trivial", TRUE_VALUE); > 1004: } Hi Roberto, Ihis is just a drive-by comment and I know that this style is standard in the IGP source code. However, have you considered re-writing this in the style of setting up the data and then looping over the data in order to print it? Here's an example transformation I did from some other code in IGP: ```c++ // Before if (flags & Node::Flag_is_Copy) { print_prop("is_copy", "true"); } if (flags & Node::Flag_rematerialize) { print_prop("rematerialize", "true"); } if (flags & Node::Flag_needs_anti_dependence_check) { print_prop("needs_anti_dependence_check", "true"); } if (flags & Node::Flag_is_macro) { print_prop("is_macro", "true"); } if (flags & Node::Flag_is_Con) { print_prop("is_con", "true"); } if (flags & Node::Flag_is_cisc_alternate) { print_prop("is_cisc_alternate", "true"); } if (flags & Node::Flag_is_dead_loop_safe) { print_prop("is_dead_loop_safe", "true"); } if (flags & Node::Flag_may_be_short_branch) { print_prop("may_be_short_branch", "true"); } if (flags & Node::Flag_has_call) { print_prop("has_call", "true"); } if (flags & Node::Flag_has_swapped_edges) { print_prop("has_swapped_edges", "true"); } ```c++ // After struct BKV { int r; const char *name, *v; }; const BKV r[] = { { flags & Node::Flag_is_Copy, "is_copy", "true"}, { flags & Node::Flag_rematerialize, "rematerialize", "true"}, {flags & Node::Flag_needs_anti_dependence_check, "needs_anti_dependence_check", "true"}, { flags & Node::Flag_is_macro, "is_macro", "true"}, { flags & Node::Flag_is_Con, "is_con", "true"}, { flags & Node::Flag_is_cisc_alternate, "is_cisc_alternate", "true"}, { flags & Node::Flag_is_dead_loop_safe, "is_dead_loop_safe", "true"}, { flags & Node::Flag_may_be_short_branch, "may_be_short_branch", "true"}, { flags & Node::Flag_has_call, "has_call", "true"}, { flags & Node::Flag_has_swapped_edges, "has_swapped_edges", "true"} }; for (size_t i = 0; i < sizeof(r) / sizeof(BKV); i++) { if (r[i].r != 0) { print_prop(r[i].name, r[i].v); } } You save a lot of lines of code :-). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23558#discussion_r1950785708 From rcastanedalo at openjdk.org Tue Feb 11 12:59:11 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 11 Feb 2025 12:59:11 GMT Subject: RFR: 8349820: Temporarily increase MemLimit for tests until JDK-8349772 and JDK-8337821 are fixed In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 12:39:00 GMT, Tobias Hartmann wrote: > Let's increase the MemLimit for the following tests until [JDK-8349772](https://bugs.openjdk.org/browse/JDK-8349772) and [JDK-8337821](https://bugs.openjdk.org/browse/JDK-8337821) are fixed: > > test/hotspot/jtreg/vmTestbase/jit/t/t105/t105.java > test/hotspot/jtreg/vmTestbase/vm/mlvm/meth/stress/compiler/i2c_c2i/Test.java > > > Thanks, > Tobias Looks good. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23561#pullrequestreview-2608666914 From epeter at openjdk.org Tue Feb 11 13:02:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 11 Feb 2025 13:02:10 GMT Subject: RFR: 8349820: Temporarily increase MemLimit for tests until JDK-8349772 and JDK-8337821 are fixed In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 12:39:00 GMT, Tobias Hartmann wrote: > Let's increase the MemLimit for the following tests until [JDK-8349772](https://bugs.openjdk.org/browse/JDK-8349772) and [JDK-8337821](https://bugs.openjdk.org/browse/JDK-8337821) are fixed: > > test/hotspot/jtreg/vmTestbase/jit/t/t105/t105.java > test/hotspot/jtreg/vmTestbase/vm/mlvm/meth/stress/compiler/i2c_c2i/Test.java > > > Thanks, > Tobias Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23561#pullrequestreview-2608673835 From duke at openjdk.org Tue Feb 11 13:28:18 2025 From: duke at openjdk.org (duke) Date: Tue, 11 Feb 2025 13:28:18 GMT Subject: Withdrawn: 8342676: Unsigned Vector Min / Max transforms In-Reply-To: <21riF_Q0FMyzOh_sakTclKfYa-nJm4klfkyHEYi4ctI=.76933a14-fb5e-447e-873a-59a2b870b842@github.com> References: <21riF_Q0FMyzOh_sakTclKfYa-nJm4klfkyHEYi4ctI=.76933a14-fb5e-447e-873a-59a2b870b842@github.com> Message-ID: On Mon, 21 Oct 2024 11:01:04 GMT, Jatin Bhateja wrote: > Adding following IR transforms for unsigned vector Min / Max nodes. > > => UMinV (UMinV(a, b), UMaxV(a, b)) => UMinV(a, b) > => UMinV (UMinV(a, b), UMaxV(b, a)) => UMinV(a, b) > => UMaxV (UMinV(a, b), UMaxV(a, b)) => UMaxV(a, b) > => UMaxV (UMinV(a, b), UMaxV(b, a)) => UMaxV(a, b) > => UMaxV (a, a) => a > => UMinV (a, a) => a > > New IR validation test accompanies the patch. > > This is a follow-up PR for https://github.com/openjdk/jdk/pull/20507 > > Best Regards, > Jatin This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/21604 From thartmann at openjdk.org Tue Feb 11 13:45:12 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 11 Feb 2025 13:45:12 GMT Subject: RFR: 8349820: Temporarily increase MemLimit for tests until JDK-8349772 and JDK-8337821 are fixed In-Reply-To: References: Message-ID: <6xw6Gfttt6lbNgBtM3O0NiIxcCp0U9TnxqA-OmwhYGk=.8401fc5f-f426-42a2-88d9-34449292bbc5@github.com> On Tue, 11 Feb 2025 12:39:00 GMT, Tobias Hartmann wrote: > Let's increase the MemLimit for the following tests until [JDK-8349772](https://bugs.openjdk.org/browse/JDK-8349772) and [JDK-8337821](https://bugs.openjdk.org/browse/JDK-8337821) are fixed: > > test/hotspot/jtreg/vmTestbase/jit/t/t105/t105.java > test/hotspot/jtreg/vmTestbase/vm/mlvm/meth/stress/compiler/i2c_c2i/Test.java > > > Thanks, > Tobias Thanks Roberto and Daniel! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23561#issuecomment-2650865547 From thartmann at openjdk.org Tue Feb 11 14:00:22 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 11 Feb 2025 14:00:22 GMT Subject: Integrated: 8349820: Temporarily increase MemLimit for tests until JDK-8349772 and JDK-8337821 are fixed In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 12:39:00 GMT, Tobias Hartmann wrote: > Let's increase the MemLimit for the following tests until [JDK-8349772](https://bugs.openjdk.org/browse/JDK-8349772) and [JDK-8337821](https://bugs.openjdk.org/browse/JDK-8337821) are fixed: > > test/hotspot/jtreg/vmTestbase/jit/t/t105/t105.java > test/hotspot/jtreg/vmTestbase/vm/mlvm/meth/stress/compiler/i2c_c2i/Test.java > > > Thanks, > Tobias This pull request has now been integrated. Changeset: ee079fdb Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/ee079fdbf1c513a4c57ef86a803eb0add651c539 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod 8349820: Temporarily increase MemLimit for tests until JDK-8349772 and JDK-8337821 are fixed Reviewed-by: rcastanedalo, epeter ------------- PR: https://git.openjdk.org/jdk/pull/23561 From rcastanedalo at openjdk.org Tue Feb 11 14:10:12 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 11 Feb 2025 14:10:12 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 12:42:15 GMT, Johan Sj?len wrote: >> This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. >> >> Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: >> >> >> java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 >> >> >> produces the following visualization for the `Initial spilling` phase: >> >> ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) >> >> Live ranges are first-class IGV entities, meaning that the user can: >> >> - search, select, and extract them; >> >> ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) >> >> - examine their properties in the `Properties` window or via tooltips; >> >> ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) >> >> - navigate to related IGV entities via a pop-up menu; and >> >> ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) >> >> - program filters that act om them according to their properties. >> >> ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) >> >> Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: >> >> ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c... > > src/hotspot/share/opto/idealGraphPrinter.cpp line 1004: > >> 1002: if (lrg._msize_valid && lrg._degree_valid && lrg.lo_degree()) { >> 1003: print_prop("trivial", TRUE_VALUE); >> 1004: } > > Hi Roberto, > > Ihis is just a drive-by comment and I know that this style is standard in the IGP source code. However, have you considered re-writing this in the style of setting up the data and then looping over the data in order to print it? > > Here's an example transformation I did from some other code in IGP: > > ```c++ > // Before > if (flags & Node::Flag_is_Copy) { > print_prop("is_copy", "true"); > } > if (flags & Node::Flag_rematerialize) { > print_prop("rematerialize", "true"); > } > if (flags & Node::Flag_needs_anti_dependence_check) { > print_prop("needs_anti_dependence_check", "true"); > } > if (flags & Node::Flag_is_macro) { > print_prop("is_macro", "true"); > } > if (flags & Node::Flag_is_Con) { > print_prop("is_con", "true"); > } > if (flags & Node::Flag_is_cisc_alternate) { > print_prop("is_cisc_alternate", "true"); > } > if (flags & Node::Flag_is_dead_loop_safe) { > print_prop("is_dead_loop_safe", "true"); > } > if (flags & Node::Flag_may_be_short_branch) { > print_prop("may_be_short_branch", "true"); > } > if (flags & Node::Flag_has_call) { > print_prop("has_call", "true"); > } > if (flags & Node::Flag_has_swapped_edges) { > print_prop("has_swapped_edges", "true"); > } > > ```c++ > // After > struct BKV { int r; const char *name, *v; }; > const BKV r[] = { > { flags & Node::Flag_is_Copy, "is_copy", "true"}, > { flags & Node::Flag_rematerialize, "rematerialize", "true"}, > {flags & Node::Flag_needs_anti_dependence_check, "needs_anti_dependence_check", "true"}, > { flags & Node::Flag_is_macro, "is_macro", "true"}, > { flags & Node::Flag_is_Con, "is_con", "true"}, > { flags & Node::Flag_is_cisc_alternate, "is_cisc_alternate", "true"}, > { flags & Node::Flag_is_dead_loop_safe, "is_dead_loop_safe", "true"}, > { flags & Node::Flag_may_be_short_branch, "may_be_short_branch", "true"}, > { flags & Node::Flag_has_call, "has_call", "true"}, > { flags & Node::Flag_has_swapped_edges, ... Thanks for the suggestion, Johan! I like the proposal but I think it is best left out as a separate RFE, to make sure it is applied consistently to the entire IGV graph printing code. I created [JDK-8349835](https://bugs.openjdk.org/browse/JDK-8349835) for that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23558#discussion_r1950919420 From duke at openjdk.org Tue Feb 11 15:33:18 2025 From: duke at openjdk.org (duke) Date: Tue, 11 Feb 2025 15:33:18 GMT Subject: RFR: 8349579: jsvml.dll incorrect RDATA SEGMENT specification [v2] In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 03:44:27 GMT, Mohamed Issa wrote: >> A fix for incorrectly defined program segments in Windows SVML assembly. >> >> - Changes _READ_ to _READONLY_ in all math functions >> - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) >> >> The tier1 tests show the changes didn't introduce new failures. > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Update full name @missa-prime Your change (at version f29de5c1c3f4f7f62f20d37d4f02b0a37aa357f4) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23503#issuecomment-2651173928 From duke at openjdk.org Tue Feb 11 15:36:25 2025 From: duke at openjdk.org (Mohamed Issa) Date: Tue, 11 Feb 2025 15:36:25 GMT Subject: Integrated: 8349579: jsvml.dll incorrect RDATA SEGMENT specification In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 21:34:39 GMT, Mohamed Issa wrote: > A fix for incorrectly defined program segments in Windows SVML assembly. > > - Changes _READ_ to _READONLY_ in all math functions > - Now compliant with MASM x86 and x86_64 program segment [specification](https://learn.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-170) > > The tier1 tests show the changes didn't introduce new failures. This pull request has now been integrated. Changeset: a1bcda24 Author: Mohamed Issa Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/a1bcda247617a839cd797bdd8bd3bf3216dff8a8 Stats: 39 lines in 36 files changed: 0 ins; 0 del; 39 mod 8349579: jsvml.dll incorrect RDATA SEGMENT specification Reviewed-by: sviswanathan, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/23503 From lucy at openjdk.org Tue Feb 11 17:15:28 2025 From: lucy at openjdk.org (Lutz Schmidt) Date: Tue, 11 Feb 2025 17:15:28 GMT Subject: RFR: 8341908: CodeHeapAnalytics: Output Imperfections and unwanted vm termination [v2] In-Reply-To: References: Message-ID: > Output is properly aligned again now. Was messed up when method hotness was removed (part of method sweeper). > Assertions have been replaced by printing an error message and gracefully returning. Avoids vm crashes caused by diagnostic actions. > Some code restructuring, removal of redundancies. > > Reviews are highly welcomed. Lutz Schmidt has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: 8341908: Resolve merge conflict ------------- Changes: https://git.openjdk.org/jdk/pull/21452/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21452&range=01 Stats: 147 lines in 2 files changed: 72 ins; 29 del; 46 mod Patch: https://git.openjdk.org/jdk/pull/21452.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21452/head:pull/21452 PR: https://git.openjdk.org/jdk/pull/21452 From lucy at openjdk.org Tue Feb 11 17:15:28 2025 From: lucy at openjdk.org (Lutz Schmidt) Date: Tue, 11 Feb 2025 17:15:28 GMT Subject: RFR: 8341908: CodeHeapAnalytics: Output Imperfections and unwanted vm termination In-Reply-To: References: Message-ID: On Thu, 10 Oct 2024 14:45:55 GMT, Lutz Schmidt wrote: > Output is properly aligned again now. Was messed up when method hotness was removed (part of method sweeper). > Assertions have been replaced by printing an error message and gracefully returning. Avoids vm crashes caused by diagnostic actions. > Some code restructuring, removal of redundancies. > > Reviews are highly welcomed. Sorry for force-pushing. Switching to private account and equipment messed up my local repositories. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21452#issuecomment-2651497237 From dlunden at openjdk.org Tue Feb 11 18:54:27 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 11 Feb 2025 18:54:27 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: <-KC83AQUZqpnDH0lG1yvKaVUV3H5kSN8cQLU1x4e06o=.dbe58228-89fe-4128-9cdb-3953d9215d59@github.com> Message-ID: On Tue, 11 Feb 2025 08:17:23 GMT, Roberto Casta?eda Lozano wrote: > Up to you whether to integrate this point fix and continue your investigation separately or wait until you have explored along this line before integration. I have considered and implemented a couple of alternative fixes today, but they are not really more elegant than the fix in this PR. If I want to fix the memory graph at loop cloning, what I'm really doing is duplicating the Phi idealization that we already have. So, then I think it would make most sense to work out the combinatorial issues with option 2 that I posted above (for making the Phi idealization less restrictive). I'm leaning towards integrating this for now, but will explore a bit further first. > Right, the details are not obvious to me either, it would probably require some exploration before we can formalize what it is exactly that we want to verify, since there is no specification (as far as I know) of what is expected for the memory subgraph in terms of liveness and interference. Let's discuss this offline! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1951406631 From bkilambi at openjdk.org Tue Feb 11 20:27:44 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 11 Feb 2025 20:27:44 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector Message-ID: This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - Benchmark (size) Mode Cnt Gain SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. ------------- Commit messages: - 8348868: AArch64: Add backend support for SelectFromTwoVector Changes: https://git.openjdk.org/jdk/pull/23570/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8348868 Stats: 317 lines in 9 files changed: 221 ins; 0 del; 96 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From lucy at openjdk.org Tue Feb 11 21:06:48 2025 From: lucy at openjdk.org (Lutz Schmidt) Date: Tue, 11 Feb 2025 21:06:48 GMT Subject: RFR: 8341908: CodeHeapAnalytics: Output Imperfections and unwanted vm termination [v3] In-Reply-To: References: Message-ID: > Output is properly aligned again now. Was messed up when method hotness was removed (part of method sweeper). > Assertions have been replaced by printing an error message and gracefully returning. Avoids vm crashes caused by diagnostic actions. > Some code restructuring, removal of redundancies. > > Reviews are highly welcomed. Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: 8341908: fix make error ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21452/files - new: https://git.openjdk.org/jdk/pull/21452/files/81719a36..a2a6fd75 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21452&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21452&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/21452.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21452/head:pull/21452 PR: https://git.openjdk.org/jdk/pull/21452 From lmesnik at openjdk.org Tue Feb 11 22:43:35 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Tue, 11 Feb 2025 22:43:35 GMT Subject: RFR: 8339889: Several compiler tests ignore vm flags and not marked as flagless [v2] In-Reply-To: <5acZ_FmW23VeDgOFMiEuUa60TLxaOcC3wWZVwHFh8EU=.95188fc9-2f54-47af-a91c-4855db76f399@github.com> References: <5acZ_FmW23VeDgOFMiEuUa60TLxaOcC3wWZVwHFh8EU=.95188fc9-2f54-47af-a91c-4855db76f399@github.com> Message-ID: > Tests > compiler/c2/TestReduceAllocationAndHeapDump.java > compiler/calls/NativeCalls.java > compiler/debug/TestStress.java > compiler/inlining/TestDuplicatedLateInliningOutput.java > ignore vm flags using limited process builder and not marked as flagless. > > Please note that test > compiler/inlining/TestDuplicatedLateInliningOutput.java > is failing with some VM flags. See > https://bugs.openjdk.org/browse/JDK-8348214 > > I haven't excluded test, since it fail with certain non-common flags only. Leonid Mesnik has updated the pull request incrementally with one additional commit since the last revision: test updated as suggested. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23224/files - new: https://git.openjdk.org/jdk/pull/23224/files/554f00b3..80796e2a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23224&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23224&range=00-01 Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23224.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23224/head:pull/23224 PR: https://git.openjdk.org/jdk/pull/23224 From kvn at openjdk.org Tue Feb 11 23:12:29 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Feb 2025 23:12:29 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: <8Qdk6YuPBt5zXcv5DX0UaeOHHGRoNDFzysLurnZ7hsY=.62db16a3-fae0-4934-ac2c-91de1f95b689@github.com> On Fri, 24 Jan 2025 20:37:32 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp Good work. Here are my comments. And I did not look on AArch64 changes. src/hotspot/share/code/codeBlob.cpp line 72: > 70: size += align_up(cb->total_content_size(), oopSize); > 71: size += align_up(cb->total_oop_size(), oopSize); > 72: size += align_up(cb->total_metadata_size(), oopSize); Please add assert that this is `nmethod` or `cb->total_oop_size()` + `cb->total_metadata_size()` is 0. To make sure we have them only in `nmethod`. src/hotspot/share/code/codeBlob.cpp line 88: > 86: _caller_must_gc_arguments(caller_must_gc_arguments), > 87: _mutable_data(nullptr), > 88: _mutable_data_size(mutable_data_size) Heave to be moved up for my suggestion in `codeBlob.hpp` src/hotspot/share/code/codeBlob.cpp line 102: > 100: > 101: if (_mutable_data_size > 0) { > 102: _mutable_data = (address)os::malloc(_mutable_data_size, mtCode); Missing `os::free` in `CodeBlob::purge()`. Should be moved from nmethod. src/hotspot/share/code/codeBlob.hpp line 177: > 175: address mutable_data_begin() const { return _mutable_data; } > 176: address mutable_data_end() const { return _mutable_data + _mutable_data_size; } > 177: int mutable_data_size() const { return _mutable_data_size; } May be move down after `blob_end()`. And `relocation_*()` after it, to follow new layout. src/hotspot/share/code/codeBlob.hpp line 181: > 179: address content_end() const { return (address) header_begin() + _data_offset; } > 180: address code_begin() const { return (address) header_begin() + _code_offset; } > 181: // code_end == content_end is true for all types of blobs for now, it is also checked in the constructor Leave this comment. src/hotspot/share/code/nmethod.cpp line 1080: > 1078: #endif > 1079: > 1080: static int required_mutable_data_space(CodeBuffer* code_buffer, `space` -> `size` src/hotspot/share/code/nmethod.cpp line 2148: > 2146: if (_mutable_data != blob_end()) { > 2147: os::free(_mutable_data); > 2148: _mutable_data = blob_end(); // Valid not null address This should be done in `CodeBlob::purge();` src/hotspot/share/code/nmethod.cpp line 3096: > 3094: p2i(mutable_data_begin()), > 3095: p2i(mutable_data_end()), > 3096: immutable_data_size()); `immutable_data_size()` is incorrect here. src/hotspot/share/code/nmethod.cpp line 3142: > 3140: p2i(mutable_data_begin()), > 3141: p2i(mutable_data_end()), > 3142: mutable_data_size()); Duplicated output. Remove. src/hotspot/share/code/nmethod.hpp line 139: > 137: // - header (the nmethod structure) > 138: // [Relocation] > 139: // - relocation information Whole this comment have to be re-written to reflect new layout and additional data sections outside CodeCache. ------------- PR Review: https://git.openjdk.org/jdk/pull/21276#pullrequestreview-2610156853 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951711125 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951695961 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951697949 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951688083 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951692337 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951713931 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951720695 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951723676 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951726173 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951727564 From kvn at openjdk.org Tue Feb 11 23:12:31 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Feb 2025 23:12:31 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Thu, 21 Nov 2024 14:12:23 GMT, Boris Ulasevich wrote: >> src/hotspot/share/code/codeBlob.hpp line 129: >> >>> 127: address _mutable_data; >>> 128: int _mutable_data_size; >>> 129: >> >> Should we add special CodeBlob subclass for nmethod to avoid increase of size for all blobs and stubs? > > I am not sure. All CodeBlobs with relocation info needs a mutable data. Let me know if you think it must be a separate subclass. Please, move `_mutable_data` after `_oop_maps` and `_mutable_data_size` after `_size` to avoid padding. Also update comment at line 70 to describe new CodeBlob layout. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951678096 From kvn at openjdk.org Tue Feb 11 23:12:31 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Feb 2025 23:12:31 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <5qBcX1j2O16hvCKyLjknxQqH50qdfwhlQf2P1FEUqEU=.451f7d8b-2bf2-4f28-8c8c-79e7a7b8d613@github.com> Message-ID: On Tue, 10 Dec 2024 22:39:40 GMT, Boris Ulasevich wrote: >> src/hotspot/share/code/codeBlob.hpp line 108: >> >>> 106: >>> 107: int _size; // total size of CodeBlob in bytes >>> 108: int _relocation_size; // size of relocation (could be bigger than 64Kb) >> >> For offsets into the external mutable/immutable data, we could reduce codecache footprint further by moving these into a a header section of the external data block. That also allows those blocks to be self-describing, which could help with error reporting or debugging. > > Sounds reasonable. But the downside is that in this case the oops iterator needs an additional load (nmethod->mutable_data->oop_size) to check if the oops list is empty. Lets not do that. All offsets should be kept in CodeBlob (or nmethod) in CodeCache. No additional loads and simple code for caching code for Leyden. >> src/hotspot/share/code/nmethod.cpp line 2152: >> >>> 2150: delete[] _compiled_ic_data; >>> 2151: >>> 2152: if (_immutable_data != blob_end()) { >> >> Is this just a name change, or a semantic change? > > In several places _immutable_data is set to data_end() "valid not null address" address which was actually the end of the code blob. With this change I remove the _data part of the code blob as well as the data_begin() and data_end() functions. I think blob_end() is a good replacement for data_end() for empty _immutable_data cases. Yes, `blob_end()` (or `code_end()` which is the same) is fine to use here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951673194 PR Review Comment: https://git.openjdk.org/jdk/pull/21276#discussion_r1951719410 From kvn at openjdk.org Tue Feb 11 23:58:14 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Feb 2025 23:58:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 03:11:22 GMT, Chris Plummer wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero and Minimal VM builds > > I almost wished I hadn't looked because there is a lot of SA CodeBlob support that could use some cleanup. Most notably I think most of the wrapper subclasses are not needed by SA, and could be served by one common class. See what I'm doing in #23456 for JavaThread subclasses. Wrapper classes don't need to be 1-to-1 with the class type they are wrapping. A single wrapper class type can handle any number of hotspot types. It usually just make more sense for them to be 1-to-1, but when they are trivial and the implementation is replicated across subtypes, just having one wrapper class implement them all can simplify things. > > The other thing I noticed is a lot of the subtypes seem to be doing some unnecessary things like registering Observers, and they all do something like the following: > > 44 Type type = db.lookupType("BufferBlob"); > > Even when it never references "type". > > I'm not suggesting you clean up any of this now, but just pointed it out. I might file an issue and try to clean it up myself at some point. > > I still need to take a closer look at the SA changes. Before I forgot to answer you, @plummercj I completely agree with your comment about cleaning up wrapper subclasses which do nothing. I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there? An other purpose could be a place holder for additional information in a future which never come. Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM. So yes, feel free to clean this up. I will help with review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652321179 From kvn at openjdk.org Wed Feb 12 00:11:28 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 00:11:28 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v3] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Add CodeBlob proxy vtable ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/dda20f0b..43ae0ed2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=01-02 Stats: 322 lines in 13 files changed: 175 ins; 90 del; 57 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Wed Feb 12 00:11:28 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 00:11:28 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero and Minimal VM builds I adopted Stefan's suggestion. I agree that it is more "future-proof". I also remove underscore `_` from `CodeBlobKind` names. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652333587 PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652335723 From kvn at openjdk.org Wed Feb 12 00:14:31 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 00:14:31 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v4] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix Minimal and Zero VM builds again ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/43ae0ed2..7d3dce0e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=02-03 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Wed Feb 12 00:22:31 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 00:22:31 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v5] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix Minimal and Zero VM builds once more ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/7d3dce0e..1d108349 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=03-04 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From dlong at openjdk.org Wed Feb 12 01:18:03 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 12 Feb 2025 01:18:03 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash Message-ID: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. ------------- Commit messages: - fix - tighten upper-bound on locals assert - s390 build - update bug id, copyright, in test - s390 build - ppc build - wip Changes: https://git.openjdk.org/jdk/pull/23557/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23557&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8336042 Stats: 142 lines in 8 files changed: 133 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23557.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23557/head:pull/23557 PR: https://git.openjdk.org/jdk/pull/23557 From dlong at openjdk.org Wed Feb 12 01:18:04 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 12 Feb 2025 01:18:04 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: <1VtIizP7DYsEPervTMwvNJxv0UTKHj5vR8x48Sq43ks=.46017686-9b49-4f16-afb8-a83564bcdb2f@github.com> On Tue, 11 Feb 2025 07:59:01 GMT, Dean Long wrote: > When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. > > In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. > > Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. src/hotspot/share/runtime/deoptimization.cpp line 754: > 752: int caller_actual_parameters = -1; // value not used except for interpreted frames, see below > 753: if (deopt_sender.is_interpreted_frame()) { > 754: caller_actual_parameters = callee_parameters + (caller_was_method_handle ? 1 : 0); Previously, if caller_was_method_handle was set, we would pass in 0 below, which was wrong for the has_member_arg case, and I suspect it broke JVMTI PopFrame for platforms that don't use popframe_move_outgoing_args, but I don't have a test for this suspicion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1951795243 From xgong at openjdk.org Wed Feb 12 03:05:14 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 12 Feb 2025 03:05:14 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:20:54 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Good job @Bhavana-Kilambi ! Generally looks good to me. Just some minor issues that I have left the comments. Besides, could you please add some IR tests for this optimization? Thanks! src/hotspot/cpu/aarch64/aarch64_vector.ad line 6729: > 6727: // --------------------------------SelectFromTwoVector ----------------------------- > 6728: > 6729: instruct vselect_from_two_vectors_SIFNeon(vReg dst, vReg_V17 src1, vReg_V18 src2, We have a similar rule for `VectorRearrange` such as `rearrange_HS_neon`. To unify, can we use the similar name style for this rule? Suggestion: instruct vselect_from_two_vectors_HS_neon(vReg dst, vReg_V17 src1, vReg_V18 src2, src/hotspot/cpu/aarch64/aarch64_vector.ad line 6736: > 6734: match(Set dst (SelectFromTwoVector (Binary index src1) src2)); > 6735: effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2); > 6736: format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %} Please use the same match rule name in the format. Thanks! src/hotspot/cpu/aarch64/aarch64_vector.ad line 6748: > 6746: %} > 6747: > 6748: instruct vselect_from_two_vectors(vReg dst, vReg_V17 src1, vReg_V18 src2, vReg index) %{ Could you please add comment before the rule why `v17` and `v18` are used explicitly here? ------------- PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-2610660542 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1951891631 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1951891964 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1951908421 From cjplummer at openjdk.org Wed Feb 12 03:06:15 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Wed, 12 Feb 2025 03:06:15 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Tue, 11 Feb 2025 23:55:46 GMT, Vladimir Kozlov wrote: > I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there? Possibly getName() didn't exist when PStack was first written. It would be good if PStack not only included the type name as it does now, but also the actual name of the blob, which getName() would return. > An other purpose could be a place holder for additional information in a future which never come. Yes, and you also see that with the Observer registration and the `Type type = db.lookupType()` code, which are only needed if you are going to lookup fields of the subtypes, which most don't ever do, yet they all have this code. > Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM. Yeah, that's not working right for CodeBlob subtypes that are not RuntimeStubs. Easy to fix though. > So yes, feel free to clean this up. I will help with review. Ok. Let me see where things are at after you are done with the PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652549878 From amitkumar at openjdk.org Wed Feb 12 03:08:21 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 12 Feb 2025 03:08:21 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v4] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <-aHCYC9iVc4eMZ3pMfiDpqaW-wGM_s3zRMiVBWoadCM=.910336cd-3be2-45b5-9874-63b71abf38f8@github.com> Message-ID: On Fri, 7 Feb 2025 14:52:42 GMT, Roberto Casta?eda Lozano wrote: >>> I see TestG1BarrierGeneration.java failure :( >>> >>> [TestG1BarrierGeneration_jtr.log](https://github.com/user-attachments/files/18676532/TestG1BarrierGeneration_jtr.log) >> >> @offamitkumar thanks for the report! Most likely the test failures are only due to missing optimizations (because of limitations in the barrier elision pattern matching analysis), but if you want me to confirm please send the entire jtreg log, without truncation. You can disable output truncation running the test like this: >> `make run-test TEST="compiler/gcbarriers/TestG1BarrierGeneration.java" JTREG="MAX_OUTPUT=999999999"` >> Please double-check that the output log file does not contain any `Output overflow` message. > >> @robcasloz Sure: >> >> I can spend time on it, maybe on weekend, for now I am overloaded with some other tasks. >> >> [TestG1BarrierGeneration_jtr_no_overflow.log](https://github.com/user-attachments/files/18706090/TestG1BarrierGeneration_jtr_no_overflow.log) > > Thanks Amit, I had a look and the failures are indeed due to missing barrier elisions for atomic operations on newly created objects, which is suboptimal but safe (and in practice unlikely to make a noticeable performance difference). I just disabled IR checks for the two affected tests on s390 by now (commit 956e0ac5). The issue is likely due to limitations in the pattern matching logic of barrier elision, but I do not have the proper means to debug it on s390. If you find a solution before this changeset is fully reviewed, feel free to propose a patch and I will merge it into the changeset. Otherwise, it can always be done as follow-up work. Hope this works for you! > > @robcasloz Sure: > > I can spend time on it, maybe on weekend, for now I am overloaded with some other tasks. > > [TestG1BarrierGeneration_jtr_no_overflow.log](https://github.com/user-attachments/files/18706090/TestG1BarrierGeneration_jtr_no_overflow.log) > > Thanks Amit, I had a look and the failures are indeed due to missing barrier elisions for atomic operations on newly created objects, which is suboptimal but safe (and in practice unlikely to make a noticeable performance difference). I just disabled IR checks for the two affected tests on s390 by now (commit [956e0ac](https://github.com/openjdk/jdk/commit/956e0ac5a7d580ad0e8850cfc4497da77cdb525c)). The issue is likely due to limitations in the pattern matching logic of barrier elision, but I do not have the proper means to debug it on s390. If you find a solution before this changeset is fully reviewed, feel free to propose a patch and I will merge it into the changeset. Otherwise, it can always be done as follow-up work. Hope this works for you! Thanks @robcasloz. Yes sure, that's works totally for us. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2652551761 From amitkumar at openjdk.org Wed Feb 12 03:48:09 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 12 Feb 2025 03:48:09 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Tue, 11 Feb 2025 07:59:01 GMT, Dean Long wrote: > When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. > > In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. > > Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. Hi @dean-long, I got build failure on s390: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/amit/jdk/src/hotspot/cpu/s390/abstractInterpreter_s390.cpp:190), pid=3885713, tid=3885721 # assert(l2 >= locals_base) failed: bad placement # # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-adhoc.amit.jdk) # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-adhoc.amit.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-s390x) # Problematic frame: # V [libjvm.so+0x22c07a] AbstractInterpreter::layout_activation(Method*, int, int, int, int, int, int, frame*, frame*, bool, bool)+0x4b2 # ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2652591361 From jkarthikeyan at openjdk.org Wed Feb 12 05:52:51 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 12 Feb 2025 05:52:51 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases Message-ID: Hi all, This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. Reviews would be appreciated! ------------- Commit messages: - Fix CountLeadingZerosV miscompile on AVX2 Changes: https://git.openjdk.org/jdk/pull/23579/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349637 Stats: 49 lines in 2 files changed: 48 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23579.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23579/head:pull/23579 PR: https://git.openjdk.org/jdk/pull/23579 From darcy at openjdk.org Wed Feb 12 06:04:21 2025 From: darcy at openjdk.org (Joe Darcy) Date: Wed, 12 Feb 2025 06:04:21 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 05:47:52 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6237: > 6235: vpsrld(xtmp1, xtmp1, 24, vec_enc); > 6236: > 6237: // As 2^24 is the largest possible integer that can be exactly represented by a float value, special handling has to be More exactly, +/- 2^24 is the region were all contiguous integers can be represented as floats. All sufficiently large finite floating-point numbers are integers, but the distance between adjacent floating-point numbers is more than 1. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952028450 From darcy at openjdk.org Wed Feb 12 06:07:14 2025 From: darcy at openjdk.org (Joe Darcy) Date: Wed, 12 Feb 2025 06:07:14 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 05:47:52 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 1: > 1: /* As a one-off test, in other words not a test to be checked in and run continuously, it would be reassuring to test al the int values are make sure the intrinsic is computing the same result as the Java library version. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952030543 From qamai at openjdk.org Wed Feb 12 06:36:21 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Feb 2025 06:36:21 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v22] In-Reply-To: References: Message-ID: On Sat, 8 Feb 2025 18:30:56 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > Reword correctness (fixes). src/hotspot/share/opto/mulnode.cpp line 2113: > 2111: // > 2112: // expr % mod == 0 (multiple of power of two) > 2113: // => (a + expr) % mod == a % mod (zero element in modular arithmetic) The use of the `%` operator here is pretty misleading, what we want is the unsigned remainder (a.k.a the floor remainder or the mathematical congruent notation). You also need the modular of the int or long arithmetic to be divisible by `mod` for this to work. I suggest writing it in pure mathematical form like this. Better formatting is encouraged: This section is concerned with pure mathematics values, not programming arithmetic values. For unsigned integers x, y; let's denote x mod y be the unsigned remainder of x when divided by y. It satisfies: - (x mod y) < y - There exists integer q such that x == q * y + (x mod y) According to the basic properties of division, this value is unique We then have: 1. a mod 2**w == (a mod 2**W) mod 2**w Proof: Call a mod 2**w = r1 and a mod 2**W = r2, we have: a == q1 * 2**w + r1 a == q2 * 2**W + r2 -> q1 * 2**w + r1 == q2 * 2**W + r2 -> r2 == (q1 - q2 * 2**(W - w)) * 2**w + r1 -> r2 mod 2**w == r1 -> (a mod 2**W) mod 2**w == a mod 2**w (qed) 2. expr mod 2**w == 0 -> ((a + expr) mod 2**W) mod 2**w == a mod 2**w Proof: Since expr mod 2**w == 0, we have expr == q1 * 2**w Call a mod 2**w == r, we have a == q2 * 2**w + r We have a + expr == q2 * 2**w + r + q1 * 2**w == (q2 + q1) * 2**w + r Which means that (a + expr) mod 2**w == a mod 2**w Furthermore, since (a + expr) mod 2**w == ((a + expr) mod 2**W) mod 2**w (according to 1) We have ((a + expr) mod 2**W) mod 2**w == a mod 2**w (qed) Back to the programming arithmetic domain, ((a + expr) mod 2**W) is the sum of a and expr in the int/long domain, a mod 2**w == a & (2**w - 1). This leads to us having: (a + expr) & (2**w - 1) == a & (2**w - 1) Furthermore, mask & (2**w - 1) == mask (a + expr) & (2**w - 1) & mask == a & (2**w - 1) & mask -> (a + expr) & mask == a & mask (qed) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1952057285 From qamai at openjdk.org Wed Feb 12 06:36:21 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Feb 2025 06:36:21 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v22] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 06:31:08 GMT, Quan Anh Mai wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> Reword correctness (fixes). > > src/hotspot/share/opto/mulnode.cpp line 2113: > >> 2111: // >> 2112: // expr % mod == 0 (multiple of power of two) >> 2113: // => (a + expr) % mod == a % mod (zero element in modular arithmetic) > > The use of the `%` operator here is pretty misleading, what we want is the unsigned remainder (a.k.a the floor remainder or the mathematical congruent notation). You also need the modular of the int or long arithmetic to be divisible by `mod` for this to work. I suggest writing it in pure mathematical form like this. Better formatting is encouraged: > > This section is concerned with pure mathematics values, not programming arithmetic values. For unsigned integers x, y; let's denote x mod y be the unsigned remainder of x when divided by y. It satisfies: > - (x mod y) < y > - There exists integer q such that x == q * y + (x mod y) > According to the basic properties of division, this value is unique > We then have: > 1. a mod 2**w == (a mod 2**W) mod 2**w > Proof: Call a mod 2**w = r1 and a mod 2**W = r2, we have: > a == q1 * 2**w + r1 > a == q2 * 2**W + r2 > -> q1 * 2**w + r1 == q2 * 2**W + r2 > -> r2 == (q1 - q2 * 2**(W - w)) * 2**w + r1 > -> r2 mod 2**w == r1 > -> (a mod 2**W) mod 2**w == a mod 2**w (qed) > 2. expr mod 2**w == 0 -> ((a + expr) mod 2**W) mod 2**w == a mod 2**w > Proof: Since expr mod 2**w == 0, we have expr == q1 * 2**w > Call a mod 2**w == r, we have a == q2 * 2**w + r > We have a + expr == q2 * 2**w + r + q1 * 2**w == (q2 + q1) * 2**w + r > Which means that (a + expr) mod 2**w == a mod 2**w > Furthermore, since (a + expr) mod 2**w == ((a + expr) mod 2**W) mod 2**w (according to 1) > We have ((a + expr) mod 2**W) mod 2**w == a mod 2**w (qed) > > Back to the programming arithmetic domain, ((a + expr) mod 2**W) is the sum of a and expr in the int/long domain, a mod 2**w == a & (2**w - 1). This leads to us having: > (a + expr) & (2**w - 1) == a & (2**w - 1) > Furthermore, mask & (2**w - 1) == mask > (a + expr) & (2**w - 1) & mask == a & (2**w - 1) & mask > -> (a + expr) & mask == a & mask (qed) Jumping from the mathematical world to the programming arithmetic world is always tricky and we should always be cautious, in this case, the use of `a + expr` needs extra cautions because it is really `(a + expr) mod 2**W` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22856#discussion_r1952059259 From qamai at openjdk.org Wed Feb 12 06:49:10 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Feb 2025 06:49:10 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: References: Message-ID: <8UZrKXluhEQfQK1rRu3OFGqmiEINjKZ1TQcYaRqRLSU=.7f131b8f-b715-433a-b46d-7e327360daa1@github.com> On Wed, 12 Feb 2025 05:47:52 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6242: > 6240: vpxor(xtmp2, xtmp2, xtmp2, vec_enc); > 6241: vpsrld(xtmp2, src, 24, vec_enc); > 6242: vpandn(src, xtmp2, src, vec_enc); Are you sure this will work? I don't see how `x &~ (x >> 24)` can zero out all lower bits. Also, we should not kill `src` here. I think a better solution is to do: `x > 0xFF ? x &~ 0xFF : x`. I chose this value because we have `0xFF` in `xtmp1` here. test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 62: > 60: inputInt = new int[LEN]; > 61: outputInt = new int[LEN]; > 62: rng = new Random(); Please use `Util.getRandomInstance()` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952070033 PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952070879 From fyang at openjdk.org Wed Feb 12 06:59:14 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 12 Feb 2025 06:59:14 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Tue, 11 Feb 2025 07:59:01 GMT, Dean Long wrote: > When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. > > In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. > > Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. FYI: `test/hotspot/jtreg/compiler/jsr292/MHDeoptTest.java` and hs-tier1 test good on linux-riscv64 with fastdebug build. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2652818321 From fyang at openjdk.org Wed Feb 12 07:20:11 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 12 Feb 2025 07:20:11 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic In-Reply-To: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: <1Es7J69UDLKVT0FLpdKSTw8XZqmPH1QmJDlxQwlurwQ=.967352fd-6824-428c-b533-027bc352290c@github.com> On Tue, 11 Feb 2025 03:15:47 GMT, Gui Cao wrote: > Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. > > > ### JMH numbers (tested on milkv megrez with hotspot client build): > > #### before this patch: > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op > SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op > SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op > SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op > SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op > SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op > SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op > SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op > SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op > SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op > SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op > SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op > SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op > SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op > SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op > SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op > SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op > SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op > SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op > SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op > SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op > SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op > SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op > SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op > SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op > SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op > SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op > SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op > SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op > SecondarySupersLookup.testPositive03 avgt 15 63.469 ? 0.080 ns/op > SecondarySupersLookup.testPositive04 avgt 15 63.896 ... Looks fine to me. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23551#pullrequestreview-2611013226 From aph at openjdk.org Wed Feb 12 09:01:35 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 12 Feb 2025 09:01:35 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:20:54 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 4955: > 4953: match(Set dst (SelectFromTwoVector (Binary index src1) src2)); > 4954: effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2); > 4955: format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %} Be careful here. `select_from_two_vectors_SIFNeon` seems to alter `src1`, so you need a `USE_KILL` effect. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1952234760 From jbhateja at openjdk.org Wed Feb 12 09:11:15 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 09:11:15 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 05:47:52 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6238: > 6236: > 6237: // As 2^24 is the largest possible integer that can be exactly represented by a float value, special handling has to be > 6238: // done to avoid losing precision by potentially rounding up. To avoid that, we construct a mask to remove low set bits IEEE single precision floating point format has a fixed precision of 24 bits, to accommodate a higher precision number exponent is incremented accordingly, even though this leads to a precision loss, but grants a larger dynamic range to a floating point number in comparison to a fixed point integral format. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6241: > 6239: // when the number has the upper 8 bits set. This is a valid transformation as it only removes low bits, and keeps the high bits intact. > 6240: vpxor(xtmp2, xtmp2, xtmp2, vec_enc); > 6241: vpsrld(xtmp2, src, 24, vec_enc); There is an output dependency here since both instructions have same destination operand. I agree with @merykitty suggestion to chop off lower 8 bits to neglect any side-effects in exponent on account of precision e.g. jshell> (Float.floatToRawIntBits((float)(0x01FFFFFE)) >>> 23) - 127 $68 ==> 24 jshell> (Float.floatToRawIntBits((float)(0x01FFFFFF)) >>> 23) - 127 $69 ==> 25 ``` In both the above case we should return 24. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6242: > 6240: vpxor(xtmp2, xtmp2, xtmp2, vec_enc); > 6241: vpsrld(xtmp2, src, 24, vec_enc); > 6242: vpandn(src, xtmp2, src, vec_enc); We should not be modifying the 'src' as it may have later use. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6243: > 6241: vpsrld(xtmp2, src, 24, vec_enc); > 6242: vpandn(src, xtmp2, src, vec_enc); > 6243: There are two occurrences of vblendvps in this sequence, it would be worth using following version of blend which emulates it using cheaper instruction sequence and shows better performance on E-core (avx2) targets. https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L3553 We could do it separately and not as a part of this bug-fix PR. test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 45: > 43: > 44: public class TestNumberOfContinuousZeros { > 45: private static final int[] SPECIAL = { 0x01FFFFFF, 0x03FFFFFE, 0x07FFFFFC, 0x0FFFFFF8, 0x1FFFFFF0, 0x3FFFFFE0 }; Can you also check for 0xFFFFFFFF, even though we have special handling for -ve signed numbers not affected by this patch. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952222539 PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952239292 PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952217092 PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952232091 PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952249864 From jbhateja at openjdk.org Wed Feb 12 09:13:17 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 09:13:17 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: <1xQeG8IO8aJNUluyWTaz9cm2xmTKSNsZJMNhnicnm5s=.304de8b6-9bba-44db-9982-eddaf950a415@github.com> On Mon, 10 Feb 2025 21:23:28 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixing typos > > An impressive and substantial change. I focused on the Java code, there are some small tweaks, presented in comments, we can make to the intrinsics to improve the expression of code, and it has no impact on the intrinsic implementation. Hi @PaulSandoz , Your comments have been addressed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2653071755 From mli at openjdk.org Wed Feb 12 09:56:22 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 12 Feb 2025 09:56:22 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI Message-ID: Hi, Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? This optimization is mainly for the vector API. On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). Thanks ## Test ### jtreg test/jdk/jdk/incubator/vector/ ### Performance master vs patch Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% ------------- Commit messages: - fix masked; clean - minors - Merge branch 'master' into mul-reduction-vx - add types, fix tests - enable it - Merge branch 'master' into mul-reduction-vx - merge master - fix issues - Initial commit Changes: https://git.openjdk.org/jdk/pull/23580/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321003 Stats: 249 lines in 8 files changed: 246 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23580.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23580/head:pull/23580 PR: https://git.openjdk.org/jdk/pull/23580 From mli at openjdk.org Wed Feb 12 10:02:52 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 12 Feb 2025 10:02:52 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v2] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? > This optimization is mainly for the vector API. > On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). > > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% > ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% > DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% > DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% > FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% > FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% > IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% > IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% > LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% > LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% > ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% > ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23580/files - new: https://git.openjdk.org/jdk/pull/23580/files/1af9ed89..5d460b1c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=00-01 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23580.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23580/head:pull/23580 PR: https://git.openjdk.org/jdk/pull/23580 From epeter at openjdk.org Wed Feb 12 11:10:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 12 Feb 2025 11:10:17 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v24] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: On Fri, 7 Feb 2025 14:09:31 GMT, Jatin Bhateja wrote: >> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. >> Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. >> >> Following are the performance stats for JMH micro included with the patch. >> >> >> Granite Rapids (P-core Xeon Server) >> Baseline : >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms >> >> Sierra Forest (E-core Xeon Server) >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.com... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review resolutions src/hotspot/share/opto/vectornode.cpp line 1086: > 1084: // increasing order of node indices. > 1085: if (in(1)->_idx > in(2)->_idx) { > 1086: return true; Ah, I see you now removed the condition above: // Must be a binary operation. if (req() != 3) { return false; } That's probably correct. But can we still have an assert somehow that `req() == 3`, please ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1952441011 From qamai at openjdk.org Wed Feb 12 11:21:14 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Feb 2025 11:21:14 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v26] In-Reply-To: <6sWBRolcCZXOe1pXDSfyBUvtfEuzV1MdMXUVpji42_4=.6c5f7da0-ee1d-485f-a99d-d8c520002dbd@github.com> References: <6sWBRolcCZXOe1pXDSfyBUvtfEuzV1MdMXUVpji42_4=.6c5f7da0-ee1d-485f-a99d-d8c520002dbd@github.com> Message-ID: On Sat, 1 Feb 2025 19:27:01 GMT, Johannes Graham wrote: >> C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. > > Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: > > add IR tests for long, simplify tests for int Otherwise LGTM, very nice tests, thanks very much! src/hotspot/share/opto/addnode.cpp line 1028: > 1026: if (r0->is_con() && r1->is_con()) { > 1027: // Constant fold: (c1 ^ c2) -> c3 > 1028: return TypeInt::make( r0->get_con() ^ r1->get_con() ); Format src/hotspot/share/opto/addnode.cpp line 1056: > 1054: if (r0->is_con() && r1->is_con()) { > 1055: // Constant fold: (c1 ^ c2) -> c3 > 1056: return TypeLong::make( r0->get_con() ^ r1->get_con() ); Format here, too. test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 313: > 311: } > 312: > 313: /* Please remove these commented out code ------------- Marked as reviewed by qamai (Committer). PR Review: https://git.openjdk.org/jdk/pull/23089#pullrequestreview-2611597506 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1952453683 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1952453982 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1952456217 From jbhateja at openjdk.org Wed Feb 12 11:40:09 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 11:40:09 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 09:08:14 GMT, Jatin Bhateja wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 45: > >> 43: >> 44: public class TestNumberOfContinuousZeros { >> 45: private static final int[] SPECIAL = { 0x01FFFFFF, 0x03FFFFFE, 0x07FFFFFC, 0x0FFFFFF8, 0x1FFFFFF0, 0x3FFFFFE0 }; > > Can you also check for 0xFFFFFFFF, even though we have special handling for -ve signed numbers not affected by this patch. Can you kindly write an exhaustive functional correctness test that covers entier positive integral value range. Idea here is to test all the cases where unintended exponent increment can lead to incorrect results. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952484248 From jbhateja at openjdk.org Wed Feb 12 11:59:02 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 11:59:02 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v25] In-Reply-To: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: > Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. > Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. > > Following are the performance stats for JMH micro included with the patch. > > > Granite Rapids (P-core Xeon Server) > Baseline : > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms > > Sierra Forest (E-core Xeon Server) > Baseline: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 3352.839 ... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Safety assertion added ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22863/files - new: https://git.openjdk.org/jdk/pull/22863/files/fd39a429..316507cb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22863&range=24 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22863&range=23-24 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/22863.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22863/head:pull/22863 PR: https://git.openjdk.org/jdk/pull/22863 From mli at openjdk.org Wed Feb 12 12:04:16 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 12 Feb 2025 12:04:16 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic In-Reply-To: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Tue, 11 Feb 2025 03:15:47 GMT, Gui Cao wrote: > Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. > > > ### JMH numbers (tested on milkv megrez with hotspot client build): > > #### before this patch: > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op > SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op > SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op > SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op > SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op > SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op > SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op > SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op > SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op > SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op > SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op > SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op > SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op > SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op > SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op > SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op > SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op > SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op > SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op > SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op > SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op > SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op > SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op > SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op > SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op > SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op > SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op > SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op > SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op > SecondarySupersLookup.testPositive03 avgt 15 63.469 ? 0.080 ns/op > SecondarySupersLookup.testPositive04 avgt 15 63.896 ... Some minor comments. src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp line 932: > 930: // Mirror: c_rarg0 > 931: // Object: c_rarg1 > 932: // Temps: x13, x14, x15, x16, x17 in this patch, maybe we should consistently use `c_rarg`n or `x1`n rather than mix these 2 types of register names? src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp line 953: > 951: __ ld(x17, Address(x17)); > 952: __ beq(klass, x17, success); > 953: __ mv(result, 0); This is also a failure case? Could it bring some benefit to jump to `fail` below or put the `fail` target here (i.e. remove the `fail` below)? src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp line 963: > 961: > 962: __ lookup_secondary_supers_table_var(obj, klass, result, x13, x14, x15, x17, &success); > 963: __ bind(fail); An extra empty line above this would help read code. ------------- PR Review: https://git.openjdk.org/jdk/pull/23551#pullrequestreview-2611700435 PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1952516809 PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1952516723 PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1952516965 From jbhateja at openjdk.org Wed Feb 12 12:10:13 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 12:10:13 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 17:41:34 GMT, Jatin Bhateja wrote: > > @jatin-bhateja Doing the transformation to `AndF` would be a more general solution and thus better. > > > Introducing another new IR "AndF" will again need changes in auto-vectorizer. > > > > > > But currently, `CopySign` and `MoveF2I` are not vectorized anyway so we can do the vectorization of `AndF` in a separate patch without much hassle. `AndF` is vectorized into existing `AndV` nicely so it is not a too complicated work. > > Yes, I have a follow-up patch to auto-vectorized CopySign. > > > > this patch does not break existing IR invariants > > > > > > Also, what invariant can be broken by transforming `AndI(MoveF2I(x), MoveF2I(y)` into `MoveF2I(AndF(x, y))`? > > Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance. Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body. For the time being, taking CopySign intrinsic route looks reasonable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2653523697 From jbhateja at openjdk.org Wed Feb 12 12:20:18 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 12:20:18 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v18] In-Reply-To: <90MwDac7Q83dK8KDagHOst15xV-quGZKVE8n2tP9dsk=.351ed042-9a69-4186-b134-8c3cb6fef6cd@github.com> References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> <90MwDac7Q83dK8KDagHOst15xV-quGZKVE8n2tP9dsk=.351ed042-9a69-4186-b134-8c3cb6fef6cd@github.com> Message-ID: On Tue, 4 Feb 2025 19:20:05 GMT, Emanuel Peter wrote: >> Hi @eme64 , Kindly share the results of your test runs. > > @jatin-bhateja Tests look all good on my side. I'll make another pass in the next few days, and hopefully approve. Hi @eme64 , All comments addressed, looking forward to your approval ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2653549254 From rrich at openjdk.org Wed Feb 12 12:38:10 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 12 Feb 2025 12:38:10 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Tue, 11 Feb 2025 07:59:01 GMT, Dean Long wrote: > When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. > > In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. > > Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. src/hotspot/cpu/ppc/abstractInterpreter_ppc.cpp line 136: > 134: // Test caller-aligned placement vs callee-aligned > 135: intptr_t* l2 = caller->sp() + method->max_locals() - 1 + (frame::java_abi_size / Interpreter::stackElementSize); > 136: assert(l2 >= locals_base, "bad placement"); The assertion at L136 fails on ppc64 (similar to what @offamitkumar reported for s390x). I don't understand the assertion because it is just a stricter version of the fist one. On ppc64 the sp of `caller` is aligned down because it needs to be 16 byte aligned. `locals_base` is only 8 byte aligned. But from what I saw the difference was larger then just one word. Maybe `caller` has got an c2i extension? I guess this would be problematic. On x86_64 `l2` depends on the last expression stack pointer not on the `caller`'s sp. If you try to translate this to ppc64 then you'll get the expression used to initialize `locals_base` at L128. I think you can remove the 2nd assertion. Even the first one looks redundant. Besides that I've tested `MHDeoptTest.java` successfully on ppc64. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1952565563 From rrich at openjdk.org Wed Feb 12 12:47:10 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 12 Feb 2025 12:47:10 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: <00HHPN1Q9xrNf8Ps_9S7hOOHHmw2mNocFrQzqxzYhRA=.bb2f9c11-12c5-4efa-8314-4415e22e31f8@github.com> On Wed, 12 Feb 2025 12:35:07 GMT, Richard Reingruber wrote: > Maybe `caller` has got an c2i extension? I guess this would be problematic. I meant i2c extension. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1952578551 From shade at openjdk.org Wed Feb 12 13:02:09 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Feb 2025 13:02:09 GMT Subject: RFR: 8349858: Print compilation task before blocking compiler thread for shutdown Message-ID: JIT compilers in current Hotspot are compiling the code while being in native state. So if there is a running compilation, it does not block shutdown naturally. The shutdown code has cooperative mechanism to coordinate shutdown of compiler threads. Shutdown code sets the `CompilerBroker::should_block`, and compilers are regularly checking it with `CompilerBroker::maybe_block`. When shutdown is pending, the running compiler threads would eventually hit that `maybe_block`, block at transition to VM state, and that would allow shutdown to proceed. One of the problems with this mechanism is observability: if compiler thread was running a long-running compilation, nothing would be written in the compilation logs about it. The compilation would just -- poof! -- disappear without a trace. This is arguably against the user expectation: we print _something_ whether the compilation succeeded or failed. This kind of shutdown-during-heavy-compilation regularly happens in short runs in Leyden benchmarks. It made me scratch my head for quite a while before I understood where the compilation task went. I would like to add some sort of diagnostics for these cases. Example `-XX:+PrintCompilation` output in Leyden after the patch (includes richer compile-task timings): ... 430 W3.4 Q2.7 C0.3 4397 com.sun.tools.javac.comp.Check::checkProfile (40 bytes) 447 W0.0 Q0.0 C10.3 4398 java.util.StringJoiner::toString (53 bytes) 456 W0.0 Q10.4 C9.7 4399 java.lang.System$1::join (11 bytes) Generated source code for 51 classes and compiled them in 403 ms (1 iterations) 476 W36.6 Q11.6 C72.1 4393 com.sun.tools.javac.jvm.PoolWriter$WriteablePoolHelper::writeConstant (843 bytes) blocked 481 W0.0 Q0.0 C157.6 4390 com.sun.tools.javac.comp.TransTypes::visitIdent (129 bytes) blocked ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/23586/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23586&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349858 Stats: 14 lines in 1 file changed: 14 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23586.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23586/head:pull/23586 PR: https://git.openjdk.org/jdk/pull/23586 From bkilambi at openjdk.org Wed Feb 12 13:55:12 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 12 Feb 2025 13:55:12 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 03:02:30 GMT, Xiaohong Gong wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Good job @Bhavana-Kilambi ! Generally looks good to me. Just some minor issues that I have left the comments. Besides, could you please add some IR tests for this optimization? Thanks! Hi @XiaohongGong Thanks for your review comments :) I will get back soon with a new patchset addressing your comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-2653771107 From bkilambi at openjdk.org Wed Feb 12 13:55:13 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 12 Feb 2025 13:55:13 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 08:58:20 GMT, Andrew Haley wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 4955: > >> 4953: match(Set dst (SelectFromTwoVector (Binary index src1) src2)); >> 4954: effect(TEMP_DEF dst, TEMP tmp1, TEMP tmp2); >> 4955: format %{ "vselect_from_two_vectors_SIF $dst, $src1, $src2, $index\t# vector (4S/8S/2I/4I/2F/4F). KILL $tmp1, $tmp2" %} > > Be careful here. `select_from_two_vectors_SIFNeon` seems to alter `src1`, so you need a `USE_KILL` effect. @theRealAph Thanks for the suggestion! makes sense to add USE_KILL for the src1 usage here. I am getting into some errors when I do that. I'll resolve them and get back soon. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1952687830 From psandoz at openjdk.org Wed Feb 12 14:49:27 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Wed, 12 Feb 2025 14:49:27 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 06:32:56 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions Looks good. I merged this PR with master, successfully (at the time) with no conflicts, and ran it through tier 1 to 3 testing and there were no failures. ------------- Marked as reviewed by psandoz (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2612181239 From shade at openjdk.org Wed Feb 12 14:52:46 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Feb 2025 14:52:46 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 Message-ID: Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: ... [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 [99] javax.enterprise.deploy.shared.DConfigBeanVersionType [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 ... I narrowed it down to level downgrade in compilation policy here: https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. Additional testing: - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/23589/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23589&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349915 Stats: 6 lines in 2 files changed: 5 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23589.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23589/head:pull/23589 PR: https://git.openjdk.org/jdk/pull/23589 From kvn at openjdk.org Wed Feb 12 15:51:15 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 15:51:15 GMT Subject: RFR: 8349858: Print compilation task before blocking compiler thread for shutdown In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 12:56:17 GMT, Aleksey Shipilev wrote: > JIT compilers in current Hotspot are compiling the code while being in native state. So if there is a running compilation, it does not block shutdown naturally. The shutdown code has cooperative mechanism to coordinate shutdown of compiler threads. Shutdown code sets the `CompilerBroker::should_block`, and compilers are regularly checking it with `CompilerBroker::maybe_block`. When shutdown is pending, the running compiler threads would eventually hit that `maybe_block`, block at transition to VM state, and that would allow shutdown to proceed. > > One of the problems with this mechanism is observability: if compiler thread was running a long-running compilation, nothing would be written in the compilation logs about it. The compilation would just -- poof! -- disappear without a trace. This is arguably against the user expectation: we print _something_ whether the compilation succeeded or failed. > > This kind of shutdown-during-heavy-compilation regularly happens in short runs in Leyden benchmarks. It made me scratch my head for quite a while before I understood where the compilation task went. I would like to add some sort of diagnostics for these cases. > > Example `-XX:+PrintCompilation` output in Leyden after the patch (includes richer compile-task timings): > > > ... > 430 W3.4 Q2.7 C0.3 4397 com.sun.tools.javac.comp.Check::checkProfile (40 bytes) > 447 W0.0 Q0.0 C10.3 4398 java.util.StringJoiner::toString (53 bytes) > 456 W0.0 Q10.4 C9.7 4399 java.lang.System$1::join (11 bytes) > Generated source code for 51 classes and compiled them in 403 ms (1 iterations) > 476 W36.6 Q11.6 C72.1 4393 com.sun.tools.javac.jvm.PoolWriter$WriteablePoolHelper::writeConstant (843 bytes) blocked > 481 W0.0 Q0.0 C157.6 4390 com.sun.tools.javac.comp.TransTypes::visitIdent (129 bytes) blocked > Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23586#pullrequestreview-2612397392 From jkarthikeyan at openjdk.org Wed Feb 12 15:55:11 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 12 Feb 2025 15:55:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 06:02:05 GMT, Joe Darcy wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6237: > >> 6235: vpsrld(xtmp1, xtmp1, 24, vec_enc); >> 6236: >> 6237: // As 2^24 is the largest possible integer that can be exactly represented by a float value, special handling has to be > > More exactly, +/- 2^24 is the region were all contiguous integers can be represented as floats. All sufficiently large finite floating-point numbers are integers, but the distance between adjacent floating-point numbers is more than 1. Thanks for the clarification! I've adjusted the wording to be more clear. > test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 1: > >> 1: /* > > As a one-off test, in other words not a test to be checked in and run continuously, it would be reassuring to test al the int values are make sure the intrinsic is computing the same result as the Java library version. I ended up writing this test while debugging the patch to exhaustively check the int range: https://gist.github.com/jaskarth/6b05352c3007a2650bf084fcb4c50c13 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952926301 PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952927124 From jkarthikeyan at openjdk.org Wed Feb 12 16:18:12 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 12 Feb 2025 16:18:12 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: <8UZrKXluhEQfQK1rRu3OFGqmiEINjKZ1TQcYaRqRLSU=.7f131b8f-b715-433a-b46d-7e327360daa1@github.com> References: <8UZrKXluhEQfQK1rRu3OFGqmiEINjKZ1TQcYaRqRLSU=.7f131b8f-b715-433a-b46d-7e327360daa1@github.com> Message-ID: On Wed, 12 Feb 2025 06:45:33 GMT, Quan Anh Mai wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6242: > >> 6240: vpxor(xtmp2, xtmp2, xtmp2, vec_enc); >> 6241: vpsrld(xtmp2, src, 24, vec_enc); >> 6242: vpandn(src, xtmp2, src, vec_enc); > > Are you sure this will work? I don't see how `x &~ (x >> 24)` can zero out all lower bits. Also, we should not kill `src` here. > > I think a better solution is to do: `x > 0xFF ? x &~ 0xFF : x`. I chose this value because we have `0xFF` in `xtmp1` here. I believe the solution should work since my idea was to get rid of low bits when the high bits past 24 are all `1`, as that is the case where rounding behavior can incorrectly bump up the value. In the other cases, not removing all the low bits shouldn't have an impact on rounding and the exponent. At least, when running an [exhaustive test](https://gist.github.com/jaskarth/6b05352c3007a2650bf084fcb4c50c13) I wrote this patch fixes the output discrepancy. I can change it to the compare and blend you suggested, but I think that will be slower than this approach. Killing `src` is my mistake though, I will fix it in the next commit. > test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 62: > >> 60: inputInt = new int[LEN]; >> 61: outputInt = new int[LEN]; >> 62: rng = new Random(); > > Please use `Util.getRandomInstance()` Nice catch! I've updated it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952983061 PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952983344 From jkarthikeyan at openjdk.org Wed Feb 12 16:18:13 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 12 Feb 2025 16:18:13 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 08:47:01 GMT, Jatin Bhateja wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6242: > >> 6240: vpxor(xtmp2, xtmp2, xtmp2, vec_enc); >> 6241: vpsrld(xtmp2, src, 24, vec_enc); >> 6242: vpandn(src, xtmp2, src, vec_enc); > > We should not be modifying the 'src' as it may have later use, also > there is an output dependency here as both instruction writes to xtmp2, we can remove the zeroing instruction. Thanks for noticing that, I've fixed both points. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6243: > >> 6241: vpsrld(xtmp2, src, 24, vec_enc); >> 6242: vpandn(src, xtmp2, src, vec_enc); >> 6243: > > There are two occurrences of vblendvps in this sequence, it would be worth using following version of blend which emulates it using cheaper instruction sequence and shows better performance on E-core (avx2) targets. > > https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L3553 > > We could do it separately and not as a part of this bug-fix PR. I agree that it might be best to look at this as a followup patch. I want to do a followup RFE to explore whether the same floating point trick can be used to vectorize `Long.numberOfLeadingZeros`, and I can make the change there. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952983648 PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952984763 From jkarthikeyan at openjdk.org Wed Feb 12 16:18:14 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 12 Feb 2025 16:18:14 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases In-Reply-To: References: Message-ID: <64BvSp9v_TocTn-sOdQCEnMiMV67kTTNirU7ZmOHEN4=.4fefa383-4545-4842-9cf2-5efab16f712c@github.com> On Wed, 12 Feb 2025 11:36:42 GMT, Jatin Bhateja wrote: >> test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 45: >> >>> 43: >>> 44: public class TestNumberOfContinuousZeros { >>> 45: private static final int[] SPECIAL = { 0x01FFFFFF, 0x03FFFFFE, 0x07FFFFFC, 0x0FFFFFF8, 0x1FFFFFF0, 0x3FFFFFE0 }; >> >> Can you also check for 0xFFFFFFFF, even though we have special handling for -ve signed numbers not affected by this patch. > > Can you kindly write an exhaustive functional correctness test that covers entier positive integral value range. Idea here is to test all the cases where unintended exponent increment can lead to incorrect results. I wrote this exhaustive test while I was working on the patch: https://gist.github.com/jaskarth/6b05352c3007a2650bf084fcb4c50c13 I've also updated the array to include 0xFFFFFFFF. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1952985631 From jkarthikeyan at openjdk.org Wed Feb 12 16:25:35 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 12 Feb 2025 16:25:35 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v2] In-Reply-To: References: Message-ID: <00Wg16mf96UwjZ51EdAR0LayWBvcbgEVzJcc33TZ5v4=.8c60904d-7c42-44cb-821a-14858b281f7c@github.com> > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Comments from code review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23579/files - new: https://git.openjdk.org/jdk/pull/23579/files/92d50de3..29070d7a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=00-01 Stats: 9 lines in 2 files changed: 1 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23579.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23579/head:pull/23579 PR: https://git.openjdk.org/jdk/pull/23579 From shade at openjdk.org Wed Feb 12 16:26:17 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Feb 2025 16:26:17 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 15:57:22 GMT, Vladimir Kozlov wrote: > Should we just ignore this task selection for CTW? And just use FIFO. I assume we not go here for ciReplay and JVMCI bootstrap. Which are blocking too. Actually, yes! We should probably just accept the first task with `compile_reason() == Whitebox`. Would probably make CTW marginally faster, like we have seen with SCC tasks in Leyden. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23589#issuecomment-2654226140 From kvn at openjdk.org Wed Feb 12 16:28:32 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 16:28:32 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix Zero VM build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/1d108349..b09ddce6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=04-05 Stats: 11 lines in 2 files changed: 7 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From qamai at openjdk.org Wed Feb 12 16:36:11 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Feb 2025 16:36:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v2] In-Reply-To: References: <8UZrKXluhEQfQK1rRu3OFGqmiEINjKZ1TQcYaRqRLSU=.7f131b8f-b715-433a-b46d-7e327360daa1@github.com> Message-ID: On Wed, 12 Feb 2025 16:14:05 GMT, Jasmine Karthikeyan wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6242: >> >>> 6240: vpxor(xtmp2, xtmp2, xtmp2, vec_enc); >>> 6241: vpsrld(xtmp2, src, 24, vec_enc); >>> 6242: vpandn(src, xtmp2, src, vec_enc); >> >> Are you sure this will work? I don't see how `x &~ (x >> 24)` can zero out all lower bits. Also, we should not kill `src` here. >> >> I think a better solution is to do: `x > 0xFF ? x &~ 0xFF : x`. I chose this value because we have `0xFF` in `xtmp1` here. > > I believe the solution should work since my idea was to get rid of low bits when the high bits past 24 are all `1`, as that is the case where rounding behavior can incorrectly bump up the value. In the other cases, not removing all the low bits shouldn't have an impact on rounding and the exponent. At least, when running an [exhaustive test](https://gist.github.com/jaskarth/6b05352c3007a2650bf084fcb4c50c13) I wrote this patch fixes the output discrepancy. I can change it to the compare and blend you suggested, but I think that will be slower than this approach. Killing `src` is my mistake though, I will fix it in the next commit. Yes you are right, that was my mistake, please make it clearer in the comment. I also think that you don't need to zero `xtmp2` before the right shift, am I right? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1953021165 From gcao at openjdk.org Wed Feb 12 16:38:48 2025 From: gcao at openjdk.org (Gui Cao) Date: Wed, 12 Feb 2025 16:38:48 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: > Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. > > > ### JMH numbers (tested on milkv megrez with hotspot client build): > > #### before this patch: > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op > SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op > SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op > SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op > SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op > SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op > SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op > SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op > SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op > SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op > SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op > SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op > SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op > SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op > SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op > SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op > SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op > SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op > SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op > SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op > SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op > SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op > SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op > SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op > SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op > SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op > SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op > SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op > SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op > SecondarySupersLookup.testPositive03 avgt 15 63.469 ? 0.080 ns/op > SecondarySupersLookup.testPositive04 avgt 15 63.896 ... Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Update for Hamlin's comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23551/files - new: https://git.openjdk.org/jdk/pull/23551/files/c2a6cffe..2ec9dac1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23551&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23551&range=00-01 Stats: 8 lines in 1 file changed: 1 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23551.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23551/head:pull/23551 PR: https://git.openjdk.org/jdk/pull/23551 From gcao at openjdk.org Wed Feb 12 16:38:48 2025 From: gcao at openjdk.org (Gui Cao) Date: Wed, 12 Feb 2025 16:38:48 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Wed, 12 Feb 2025 12:01:46 GMT, Hamlin Li wrote: >> Gui Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> Update for Hamlin's comment > > src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp line 932: > >> 930: // Mirror: c_rarg0 >> 931: // Object: c_rarg1 >> 932: // Temps: x13, x14, x15, x16, x17 > > in this patch, maybe we should consistently use `c_rarg`n or `x1`n rather than mix these 2 types of register names? Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1953025047 From qamai at openjdk.org Wed Feb 12 16:44:02 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Feb 2025 16:44:02 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v42] In-Reply-To: References: Message-ID: > Hi, > > This patch adds unsigned bounds and known bits constraints to TypeInt and TypeLong. This opens more transformation opportunities in an elegant manner as well as helps avoid some ad-hoc rules in Hotspot. > > In general, a `TypeInt/Long` represents a set of values `x` that satisfies: `x s>= lo && x s<= hi && x u>= ulo && x u<= uhi && (x & zeros) == 0 && (x & ones) == ones`. These constraints are not independent, e.g. an int that lies in [0, 3] in signed domain must also lie in [0, 3] in unsigned domain and have all bits but the last 2 being unset. As a result, we must canonicalize the constraints (tighten the constraints so that they are optimal) before constructing a `TypeInt/Long` instance. > > This is extracted from #15440 , node value transformations are left for later PRs. I have also added unit tests to verify the soundness of constraint normalization. > > Please kindly review, thanks a lot. > > Testing > > - [x] GHA > - [x] Linux x64, tier 1-4 Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 55 commits: - Merge branch 'master' into unsignedbounds - harden SimpleCanonicalResult - number lemmas - include - clean up intn_t - refine first_violation - assignment operator - exhaustive tests - make con - Emmanuel's review - ... and 45 more: https://git.openjdk.org/jdk/compare/332d87cc...0f347a53 ------------- Changes: https://git.openjdk.org/jdk/pull/17508/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17508&range=41 Stats: 2348 lines in 13 files changed: 1789 ins; 325 del; 234 mod Patch: https://git.openjdk.org/jdk/pull/17508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17508/head:pull/17508 PR: https://git.openjdk.org/jdk/pull/17508 From qamai at openjdk.org Wed Feb 12 16:44:02 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Feb 2025 16:44:02 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v7] In-Reply-To: References: Message-ID: On Tue, 24 Sep 2024 16:08:52 GMT, Emanuel Peter wrote: >> @eme64 Thanks to your suggestions, I have managed to come up with a (fairly) formal proof for the algorithm here! > > @merykitty FYI: I'm going on vacation for 3 weeks, so I'll hope to come back to this afterward. @eme64 Ping ------------- PR Comment: https://git.openjdk.org/jdk/pull/17508#issuecomment-2654272971 From gcao at openjdk.org Wed Feb 12 16:51:20 2025 From: gcao at openjdk.org (Gui Cao) Date: Wed, 12 Feb 2025 16:51:20 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: <0m_JPmJGU24Zr9Mpf-hDYzEtuKajcVBhaL4qaINnfIo=.de050cde-a135-4481-a4ff-ade339444b2d@github.com> On Wed, 12 Feb 2025 12:01:46 GMT, Hamlin Li wrote: > in this patch, maybe we should consistently use `c_rarg`n or `x1`n rather than mix these 2 types of register names? Fixed. > This is also a failure case? Could it bring some benefit to jump to `fail` below or put the `fail` target here (i.e. remove the `fail` below)? Thanks for your review. In fact I don't know if this is a failure case here, I tried jumping to success or failure first(beq(klass, x17, success)/bne(klass, x17, fail)) and the performance results had little impact. > An extra empty line above this would help read code. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1953046428 PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1953046073 PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1953046699 From jbhateja at openjdk.org Wed Feb 12 17:08:25 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 17:08:25 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 14:46:49 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Review comments resolutions > > Looks good. I merged this PR with master, successfully (at the time) with no conflicts, and ran it through tier 1 to 3 testing and there were no failures. Thanks @PaulSandoz , @eme64 and @sviswa7 for your valuable feedback. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2654337191 From jbhateja at openjdk.org Wed Feb 12 17:08:28 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 17:08:28 GMT Subject: Integrated: 8342103: C2 compiler support for Float16 type and associated scalar operations In-Reply-To: References: Message-ID: <0jFE4E2Aewb7aCN5nZrmV3Lz3SSsNSmhhUEiL9JQjMA=.c202afcf-340c-4fca-8a2a-778c7677fe1f@github.com> On Sun, 15 Dec 2024 18:05:02 GMT, Jatin Bhateja wrote: > Hi All, > > This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) > > Following is the summary of changes included with this patch:- > > 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. > 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. > 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. > - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. > 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. > 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. > 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. > 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF > 9. X86 backend implementation for all supported intrinsics. > 10. Functional and Performance validation tests. > > Kindly review the patch and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 4b463ee7 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/4b463ee70eceb94fdfbffa5c49dd58dcc6a6c890 Stats: 2855 lines in 56 files changed: 2788 ins; 0 del; 67 mod 8342103: C2 compiler support for Float16 type and associated scalar operations Co-authored-by: Paul Sandoz Co-authored-by: Bhavana Kilambi Co-authored-by: Joe Darcy Co-authored-by: Raffaello Giulietti Reviewed-by: psandoz, epeter, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/22754 From jkarthikeyan at openjdk.org Wed Feb 12 17:48:10 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 12 Feb 2025 17:48:10 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v2] In-Reply-To: References: <8UZrKXluhEQfQK1rRu3OFGqmiEINjKZ1TQcYaRqRLSU=.7f131b8f-b715-433a-b46d-7e327360daa1@github.com> Message-ID: On Wed, 12 Feb 2025 16:33:08 GMT, Quan Anh Mai wrote: >> I believe the solution should work since my idea was to get rid of low bits when the high bits past 24 are all `1`, as that is the case where rounding behavior can incorrectly bump up the value. In the other cases, not removing all the low bits shouldn't have an impact on rounding and the exponent. At least, when running an [exhaustive test](https://gist.github.com/jaskarth/6b05352c3007a2650bf084fcb4c50c13) I wrote this patch fixes the output discrepancy. I can change it to the compare and blend you suggested, but I think that will be slower than this approach. Killing `src` is my mistake though, I will fix it in the next commit. > > Yes you are right, that was my mistake, please make it clearer in the comment. I also think that you don't need to zero `xtmp2` before the right shift, am I right? I'll make sure to reword the comment to describe the behavior more accurately. You are correct that we don't need to zero xtmp2, Jatin mentioned the same thing and I removed it in the last commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1953136949 From jkarthikeyan at openjdk.org Wed Feb 12 19:45:34 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 12 Feb 2025 19:45:34 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Improve explanation of logic ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23579/files - new: https://git.openjdk.org/jdk/pull/23579/files/29070d7a..36228aea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=01-02 Stats: 5 lines in 1 file changed: 2 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23579.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23579/head:pull/23579 PR: https://git.openjdk.org/jdk/pull/23579 From duke at openjdk.org Wed Feb 12 19:48:29 2025 From: duke at openjdk.org (Johannes Graham) Date: Wed, 12 Feb 2025 19:48:29 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: > C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: formatting, remove commented tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/cf779497..4a291202 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=26 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=25-26 Stats: 79 lines in 2 files changed: 0 ins; 77 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From shade at openjdk.org Wed Feb 12 19:50:51 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Feb 2025 19:50:51 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 [v2] In-Reply-To: References: Message-ID: > Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: > > > ... > [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 > [99] javax.enterprise.deploy.shared.DConfigBeanVersionType > [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 > [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 > ... > > > I narrowed it down to level downgrade in compilation policy here: > https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 > > [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. > > I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. > > I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. > > Additional testing: > - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Shortcut CTW tasks directly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23589/files - new: https://git.openjdk.org/jdk/pull/23589/files/1c6783ff..602dcfc4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23589&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23589&range=00-01 Stats: 9 lines in 1 file changed: 5 ins; 3 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23589.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23589/head:pull/23589 PR: https://git.openjdk.org/jdk/pull/23589 From shade at openjdk.org Wed Feb 12 19:57:11 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Feb 2025 19:57:11 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 [v2] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 16:23:14 GMT, Aleksey Shipilev wrote: > Actually, yes! We should probably just accept the first task with `compile_reason() == Whitebox`. Would probably make CTW marginally faster, like we have seen with SCC tasks in Leyden. Done in new commit. Still fixes the issue, and the fix is arguably cleaner than overloading the already complicated predicate with another check. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23589#issuecomment-2654710554 From kvn at openjdk.org Wed Feb 12 20:10:11 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 20:10:11 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 [v2] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:50:51 GMT, Aleksey Shipilev wrote: >> Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: >> >> >> ... >> [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 >> [99] javax.enterprise.deploy.shared.DConfigBeanVersionType >> [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 >> [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 >> ... >> >> >> I narrowed it down to level downgrade in compilation policy here: >> https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 >> >> [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. >> >> I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. >> >> I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Shortcut CTW tasks directly Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23589#pullrequestreview-2613083209 From kvn at openjdk.org Wed Feb 12 20:21:13 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 20:21:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero VM build It is ready for re-review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2654754643 From dlong at openjdk.org Wed Feb 12 20:27:12 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 12 Feb 2025 20:27:12 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 12 Feb 2025 06:56:40 GMT, Fei Yang wrote: > FYI: `test/hotspot/jtreg/compiler/jsr292/MHDeoptTest.java` and hs-tier1 test good on linux-riscv64 with fastdebug build. I good sanity check is to remove the fix in deoptimization.cpp and see if the new test triggers the new asserts. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2654764709 From mli at openjdk.org Wed Feb 12 20:34:13 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 12 Feb 2025 20:34:13 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Wed, 12 Feb 2025 16:38:48 GMT, Gui Cao wrote: >> Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. >> >> >> ### JMH numbers (tested on milkv megrez with hotspot client build): >> >> #### before this patch: >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op >> SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op >> SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op >> SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op >> SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op >> SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op >> SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op >> SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op >> SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op >> SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op >> SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op >> SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op >> SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op >> SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op >> SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op >> SecondarySupersLookup.testPositive03 avgt 15 63.... > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Update for Hamlin's comment Thanks for updating. Looks good. ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23551#pullrequestreview-2613127320 From dlong at openjdk.org Wed Feb 12 21:01:15 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 12 Feb 2025 21:01:15 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: <00HHPN1Q9xrNf8Ps_9S7hOOHHmw2mNocFrQzqxzYhRA=.bb2f9c11-12c5-4efa-8314-4415e22e31f8@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> <00HHPN1Q9xrNf8Ps_9S7hOOHHmw2mNocFrQzqxzYhRA=.bb2f9c11-12c5-4efa-8314-4415e22e31f8@github.com> Message-ID: On Wed, 12 Feb 2025 12:44:18 GMT, Richard Reingruber wrote: >> src/hotspot/cpu/ppc/abstractInterpreter_ppc.cpp line 136: >> >>> 134: // Test caller-aligned placement vs callee-aligned >>> 135: intptr_t* l2 = caller->sp() + method->max_locals() - 1 + (frame::java_abi_size / Interpreter::stackElementSize); >>> 136: assert(l2 >= locals_base, "bad placement"); >> >> The assertion at L136 fails on ppc64 (similar to what @offamitkumar reported for s390x). >> I don't understand the assertion because it is just a stricter version of the fist one. >> On ppc64 the sp of `caller` is aligned down because it needs to be 16 byte aligned. `locals_base` is only 8 byte aligned. But from what I saw the difference was larger then just one word. Maybe `caller` has got an c2i extension? I guess this would be problematic. >> On x86_64 `l2` depends on the last expression stack pointer not on the `caller`'s sp. If you try to translate this to ppc64 then you'll get the expression used to initialize `locals_base` at L128. >> I think you can remove the 2nd assertion. Even the first one looks redundant. >> Besides that I've tested `MHDeoptTest.java` successfully on ppc64. > >> Maybe `caller` has got an c2i extension? I guess this would be problematic. > > I meant i2c extension. The two asserts together are supposed to be an upper and lower bound. The first assert is a stricter version of the assert that was originally added by JDK-7090904. It looks like the 2nd assert should have been reversed, assuming l2 is correct. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1953390060 From dlong at openjdk.org Wed Feb 12 21:04:10 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 12 Feb 2025 21:04:10 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> <00HHPN1Q9xrNf8Ps_9S7hOOHHmw2mNocFrQzqxzYhRA=.bb2f9c11-12c5-4efa-8314-4415e22e31f8@github.com> Message-ID: <_qQKsbCLbRxjva6W92W8_k82ldOlqIkFnT2keBDKLlw=.320cd20b-e8b8-422c-86eb-6d2607870529@github.com> On Wed, 12 Feb 2025 20:58:34 GMT, Dean Long wrote: >>> Maybe `caller` has got an c2i extension? I guess this would be problematic. >> >> I meant i2c extension. > > The two asserts together are supposed to be an upper and lower bound. The first assert is a stricter version of the assert that was originally added by JDK-7090904. It looks like the 2nd assert should have been reversed, assuming l2 is correct. I was lazy about naming, so `l2` has a different meaning in the x64 asserts. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1953393697 From mdoerr at openjdk.org Wed Feb 12 21:06:42 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 12 Feb 2025 21:06:42 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic Message-ID: PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). The new runtime stub is called like a C function and therefore needs a `FunctionDescriptor` on PPC64 with ABIv1. The entry needs to be updated after relocation. I have used and enhanced the relocation code for that. ------------- Commit messages: - 8349727: [PPC] C1: Improve Class.isInstance intrinsic Changes: https://git.openjdk.org/jdk/pull/23602/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23602&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349727 Stats: 102 lines in 3 files changed: 93 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23602.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23602/head:pull/23602 PR: https://git.openjdk.org/jdk/pull/23602 From dlong at openjdk.org Wed Feb 12 21:09:31 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 12 Feb 2025 21:09:31 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: > When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. > > In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. > > Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. Dean Long has updated the pull request incrementally with one additional commit since the last revision: fix bounds checks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23557/files - new: https://git.openjdk.org/jdk/pull/23557/files/8734abd4..a7a0ed7a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23557&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23557&range=00-01 Stats: 6 lines in 4 files changed: 2 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23557.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23557/head:pull/23557 PR: https://git.openjdk.org/jdk/pull/23557 From dlong at openjdk.org Wed Feb 12 21:14:09 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 12 Feb 2025 21:14:09 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 12 Feb 2025 21:09:31 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > fix bounds checks I just pushed a fix for the s390 and ppc bounds check logic, but I'm still not sure if I am using the correct values for the end of the frame. The asserts should pass with the deoptimization.cpp fix. The 2nd assert should fail w/o the deoptimization.cpp fix when running the new test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2654853295 From dchuyko at openjdk.org Wed Feb 12 22:28:13 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Wed, 12 Feb 2025 22:28:13 GMT Subject: RFR: 8347917: AArch64: Enable upper GPR registers in C1 In-Reply-To: <3uBx3CxAxNTsE4zuMhEbZ85eRq2BnrJFMjnNKSQdlPQ=.4f76b91c-c510-4e15-9334-2f008c714711@github.com> References: <3uBx3CxAxNTsE4zuMhEbZ85eRq2BnrJFMjnNKSQdlPQ=.4f76b91c-c510-4e15-9334-2f008c714711@github.com> Message-ID: On Sun, 26 Jan 2025 16:16:59 GMT, Andrew Haley wrote: > > > As for the different allocation order (to prefer platform callee-saved registers), do you think something simple like last->first order will work for all platforms? > > > > > > It might. It's certainly an interesting thing to try. I'm particularly interested because it potentially reduces the overhead for type checks. > > Let's do this in a separate patch. Just a few things to keep here: 1. Even for aarch64 just reversing allocation order is not enough (callee preserved regs are saved in a caller). 2. Register saving overhead for runtime calls is there, but making a call without saving is still expensive. Consider a benchmark that keeps few values alive and performs a runtime call: long[] arr; @Setup public void setup() { arr = new long[8]; } @Benchmark public void test(Blackhole bh) { long v0 = arr[0]; long v1 = arr[1]; long v2 = arr[2]; long v3 = arr[3]; long v4 = arr[4]; long v5 = arr[5]; long v6 = arr[6]; long v7 = arr[7]; v1 += v0; v2 += v1; v3 += v2; v4 += v3; v5 += v4; v6 += v5; v7 += v6; v0 += v7; v1 *= v0; v2 *= v1; v3 *= v2; v4 *= v3; v5 *= v4; v6 *= v5; v7 *= v6; v0 *= v7; double d0 = Double.longBitsToDouble(v0); d0 = Math.sin(d0); // dsin is c1 runtime call v0 = Double.doubleToRawLongBits(d0); v1 += v0; v2 += v1; v3 += v2; v4 += v3; v5 += v4; v6 += v5; v7 += v6; v0 += v7; v1 *= v0; v2 *= v1; v3 *= v2; v4 *= v3; v5 *= v4; v6 *= v5; v7 *= v6; v0 *= v7; bh.consume(v0); bh.consume(v1); bh.consume(v2); bh.consume(v3); bh.consume(v4); bh.consume(v5); bh.consume(v6); bh.consume(v7); } In '-XX:TieredStopAtLevel=1' mode I observe results like 28.337 ? 0.803 ns/op. If dsin is calculated and consumed in the end of the method, it's like 27.039 ? 0.182 ns/op. Without the call it's 22.595 ? 0.853 ns/op. With the call hottest methods are distributed like 89.12% c1, level 1 org.openjdk.bench.vm.compiler.jmh_generated.VMCall_baseline_jmhTest::baseline_avgt_jmhStub, version 2, compile id 798 10.69% runtime stub StubRoutines::libmDsin ------------- PR Comment: https://git.openjdk.org/jdk/pull/23152#issuecomment-2654975579 From fyang at openjdk.org Thu Feb 13 01:17:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 13 Feb 2025 01:17:10 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Wed, 12 Feb 2025 16:38:48 GMT, Gui Cao wrote: >> Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. >> >> >> ### JMH numbers (tested on milkv megrez with hotspot client build): >> >> #### before this patch: >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op >> SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op >> SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op >> SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op >> SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op >> SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op >> SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op >> SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op >> SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op >> SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op >> SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op >> SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op >> SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op >> SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op >> SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op >> SecondarySupersLookup.testPositive03 avgt 15 63.... > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Update for Hamlin's comment src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp line 953: > 951: __ ld(x17, Address(x17)); > 952: __ beq(klass, x17, success); > 953: __ j(fail); Hmm, I think this jump could be saved if we put a direct return here instead. Like: __ beq(klass, x17, success); __ mv(result, 0); __ ret(); What do you think? @Hamlin-Li @zifeihan ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1953629225 From bulasevich at openjdk.org Thu Feb 13 01:25:11 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Feb 2025 01:25:11 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v10] In-Reply-To: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> > This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. > > The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. > > Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. > > The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): > - nmethod_count:134000, total_compilation_time: 510460ms > - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, > - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB > > Functional testing: jtreg on arm/aarch/x86. > Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. > > Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. Boris Ulasevich has updated the pull request incrementally with two additional commits since the last revision: - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description - add a separate adrp_movk function to to support targets located more than 4GB away ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21276/files - new: https://git.openjdk.org/jdk/pull/21276/files/04c1aa06..f2a9a7b2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21276&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21276&range=08-09 Stats: 106 lines in 6 files changed: 43 ins; 28 del; 35 mod Patch: https://git.openjdk.org/jdk/pull/21276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21276/head:pull/21276 PR: https://git.openjdk.org/jdk/pull/21276 From bulasevich at openjdk.org Thu Feb 13 01:34:26 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Feb 2025 01:34:26 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: On Mon, 3 Feb 2025 15:26:23 GMT, Andrew Haley wrote: > Do you want compressed OOPs to be moved out of CodeCache as well as uncompressed OOPs? If so, you should change `loadConNNode`in C2. I have moved the OOPs table out of the CodeCache, but its contents remain unchanged - it still holds compressed or uncompressed pointers in the CodeHeap. As I understand it, I only need to adjust how OOPs are accessed without modifying anything else. With ShenandoahGC, I pass jtreg tests with UseCompressedOops both enabled and disabled, so there doesn?t seem to be any issue. Please let me know if I?m mistaken. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2655228538 From bulasevich at openjdk.org Thu Feb 13 01:42:20 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Feb 2025 01:42:20 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache In-Reply-To: <96d307d2-136b-4cbd-9fd3-47e12e7afcc9@littlepinkcloud.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <96d307d2-136b-4cbd-9fd3-47e12e7afcc9@littlepinkcloud.com> Message-ID: <99SgFQfXKT1Gq2Gb4qcsxT3N0kz528qib-tZvX3itpY=.d25ead6c-b97f-4538-b2fe-c029edc4bf5f@github.com> On Sat, 8 Feb 2025 10:40:04 GMT, Andrew Haley wrote: > If we do decide to do this, please give the forced movk version of adrp() a new name, and have adrp() call it. Right. Thanks for pointing that out. For now, I'll add an adrp_movk() variant to handle the extra movk. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2655236260 From bulasevich at openjdk.org Thu Feb 13 01:42:21 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Feb 2025 01:42:21 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: <8Qdk6YuPBt5zXcv5DX0UaeOHHGRoNDFzysLurnZ7hsY=.62db16a3-fae0-4934-ac2c-91de1f95b689@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <8Qdk6YuPBt5zXcv5DX0UaeOHHGRoNDFzysLurnZ7hsY=.62db16a3-fae0-4934-ac2c-91de1f95b689@github.com> Message-ID: On Tue, 11 Feb 2025 23:09:29 GMT, Vladimir Kozlov wrote: > Good work. Here are my comments. I have addressed your comments. Thank you very much!! ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2655238276 From xgong at openjdk.org Thu Feb 13 01:54:49 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 13 Feb 2025 01:54:49 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations Message-ID: Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures. The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on an AArch64 128-bit vector length sve2 architecture with different UseSVE options. Here is the gain details: Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2 ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x LongMaxVector.SADD 1024 thrpt 30 82.01x 81.74x 82.153x LongMaxVector.SADDMasked 1024 thrpt 30 91.61x 92.69x 93.579x LongMaxVector.SSUB 1024 thrpt 30 81.97x 81.42x 82.991x LongMaxVector.SSUBMasked 1024 thrpt 30 91.34x 92.47x 93.026x LongMaxVector.SUADD 1024 thrpt 30 82.44x 81.29x 82.506x LongMaxVector.SUADDMasked 1024 thrpt 30 92.21x 92.35x 93.419x LongMaxVector.SUSUB 1024 thrpt 30 82.04x 80.98x 81.761x LongMaxVector.SUSUBMasked 1024 thrpt 30 91.74x 92.39x 93.375x LongMaxVector.UMAX 1024 thrpt 30 81.59x 80.21x 82.162x LongMaxVector.UMAXMasked 1024 thrpt 30 70.09x 92.89x 93.627x LongMaxVector.UMIN 1024 thrpt 30 82.31x 81.95x 82.298x LongMaxVector.UMINMasked 1024 thrpt 30 69.85x 92.19x 93.390x ShortMaxVector.SADD 1024 thrpt 30 80.08x 79.15x 80.310x ShortMaxVector.SADDMasked 1024 thrpt 30 90.74x 92.00x 93.743x ShortMaxVector.SSUB 1024 thrpt 30 79.54x 78.67x 80.584x ShortMaxVector.SSUBMasked 1024 thrpt 30 91.18x 92.10x 93.725x ShortMaxVector.SUADD 1024 thrpt 30 79.86x 79.37x 80.372x ShortMaxVector.SUADDMasked 1024 thrpt 30 90.17x 92.43x 93.759x ShortMaxVector.SUSUB 1024 thrpt 30 79.78x 79.85x 80.744x ShortMaxVector.SUSUBMasked 1024 thrpt 30 89.99x 91.91x 93.320x ShortMaxVector.UMAX 1024 thrpt 30 79.87x 79.81x 80.518x ShortMaxVector.UMAXMasked 1024 thrpt 30 89.69x 91.70x 92.826x ShortMaxVector.UMIN 1024 thrpt 30 79.11x 77.98x 79.458x ShortMaxVector.UMINMasked 1024 thrpt 30 90.49x 92.86x 93.323x Tested with `hotspot::hotspot_all` and `jdk::jdk_all`, and no new regression is found. [1] https://github.com/openjdk/jdk/pull/20507 ------------- Commit messages: - 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations Changes: https://git.openjdk.org/jdk/pull/23608/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23608&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349522 Stats: 1137 lines in 8 files changed: 673 ins; 3 del; 461 mod Patch: https://git.openjdk.org/jdk/pull/23608.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23608/head:pull/23608 PR: https://git.openjdk.org/jdk/pull/23608 From cjplummer at openjdk.org Thu Feb 13 02:36:16 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Thu, 13 Feb 2025 02:36:16 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero VM build src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118: > 116: } > 117: > 118: public static Class getClassFor(Address addr) { Did you consider using a lookup table here that is indexed using the kind value? src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146: > 144: } > 145: } > 146: return null; Should this be an assert? src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213: > 211: > 212: public boolean isUncommonTrapBlob() { > 213: if (!VM.getVM().isServerCompiler()) return false; Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below? src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95: > 93: } > 94: > 95: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) { I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr". That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953665953 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953666268 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953667349 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953682557 From fyang at openjdk.org Thu Feb 13 02:44:11 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 13 Feb 2025 02:44:11 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v2] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 10:02:52 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? >> This optimization is mainly for the vector API. >> On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). >> >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% >> ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% >> DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% >> DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% >> FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% >> FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% >> IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% >> IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% >> LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% >> LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% >> ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% >> ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% >> >> > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > comments Hi, Some comments after a cursory look. Will have a more closer look later. BTW: How should I understand the JMH data? 11170.052 ns/op before compared to 1294.424 ns/op after for ByteMaxVector.MULLanes, but the improvement says 88.40%? src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 2961: > 2959: BasicType bt, uint vector_length, VectorMask vm) { > 2960: assert(bt == T_BYTE || bt == T_SHORT || bt == T_INT || bt == T_LONG, "unsupported element type"); > 2961: uint len = vector_length/type2aelembytes(bt); Please put a space before and after the divide operator. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 3012: > 3010: lui(t0, 0x3f800000); // 1.0f > 3011: } else { > 3012: lui(t0, 0x3ff00000); // 1.0d Can you use `mv` instead of `lui`, like other places? src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 242: > 240: VectorMask vm = Assembler::unmasked); > 241: > 242: void reduce_mul_integer_v(Register dst, Register src1, VectorRegister src2, Or maybe `reduce_mul_integral_v` which will be consistent in naming with friends like `reduce_integral_v`? ------------- PR Review: https://git.openjdk.org/jdk/pull/23580#pullrequestreview-2613670192 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1953687496 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1953688274 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1953681150 From kvn at openjdk.org Thu Feb 13 03:09:23 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 03:09:23 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v10] In-Reply-To: <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> Message-ID: On Thu, 13 Feb 2025 01:25:11 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with two additional commits since the last revision: > > - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description > - add a separate adrp_movk function to to support targets located more than 4GB away Looks good. I will submit testing. ------------- PR Review: https://git.openjdk.org/jdk/pull/21276#pullrequestreview-2613717350 From fyang at openjdk.org Thu Feb 13 03:14:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 13 Feb 2025 03:14:10 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 12 Feb 2025 20:24:05 GMT, Dean Long wrote: > > FYI: `test/hotspot/jtreg/compiler/jsr292/MHDeoptTest.java` and hs-tier1 test good on linux-riscv64 with fastdebug build. > > I good sanity check is to remove the fix in deoptimization.cpp and see if the new test triggers the new asserts. Yeah! The new test triggers if I revert the fix. # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/ubuntu/jdk/src/hotspot/cpu/riscv/abstractInterpreter_riscv.cpp:145), pid=95195, tid=95217 # assert(locals >= interpreter_frame->sender_sp() + max_locals - 1) failed: bad placement # # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-adhoc.ubuntu.jdk) # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-adhoc.ubuntu.jdk, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-riscv64) # Problematic frame: # V [libjvm.so+0x2e1204] AbstractInterpreter::layout_activation(Method*, int, int, int, int, int, int, frame*, frame*, bool, bool)+0x3fa # # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/ubuntu/jdk/build/linux-riscv64- server-fastdebug/test-support/jtreg_test_hotspot_jtreg_compiler_jsr292_MHDeoptTest_java/scratch/0/core.95195) # # If you would like to submit a bug report, please visit: # https://bugreport.java.com/bugreport/crash.jsp # ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2655355838 From kvn at openjdk.org Thu Feb 13 03:43:17 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 03:43:17 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> Message-ID: <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> On Thu, 13 Feb 2025 02:06:57 GMT, Chris Plummer wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero VM build > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118: > >> 116: } >> 117: >> 118: public static Class getClassFor(Address addr) { > > Did you consider using a lookup table here that is indexed using the kind value? Example please. > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146: > >> 144: } >> 145: } >> 146: return null; > > Should this be an assert? I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned. > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213: > >> 211: >> 212: public boolean isUncommonTrapBlob() { >> 213: if (!VM.getVM().isServerCompiler()) return false; > > Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below? `UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`. Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect. > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95: > >> 93: } >> 94: >> 95: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) { > > I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr". That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);` `cbPc` with comment explaining that it could be inside code blob. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953732919 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953733212 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953738572 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953745389 From gcao at openjdk.org Thu Feb 13 04:23:17 2025 From: gcao at openjdk.org (Gui Cao) Date: Thu, 13 Feb 2025 04:23:17 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Thu, 13 Feb 2025 01:14:37 GMT, Fei Yang wrote: >> Gui Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> Update for Hamlin's comment > > src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp line 953: > >> 951: __ ld(x17, Address(x17)); >> 952: __ beq(klass, x17, success); >> 953: __ j(fail); > > Hmm, I think this jump could be saved if we put a direct return here instead. Like: > > __ beq(klass, x17, success); > __ mv(result, 0); > __ ret(); > > What do you think? @Hamlin-Li @zifeihan Yes, I think we can put a direct return here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1953780219 From cjplummer at openjdk.org Thu Feb 13 05:22:14 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Thu, 13 Feb 2025 05:22:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> Message-ID: On Thu, 13 Feb 2025 03:26:19 GMT, Vladimir Kozlov wrote: >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118: >> >>> 116: } >>> 117: >>> 118: public static Class getClassFor(Address addr) { >> >> Did you consider using a lookup table here that is indexed using the kind value? > > Example please. static Class wrapperClasses = new Class[Number_Of_Kinds]; wrapperClasses[NMethodKind] = NMethodBlob.class; wrapperClasses[BufferKind] = BufferBopb.class; ...; wrapperClasses[SafepointKind] = SafepointBlob.class; CodeBlob cb = new CodeBlob(addr); return wrapperClasses[cb.getKind()]; >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146: >> >>> 144: } >>> 145: } >>> 146: return null; >> >> Should this be an assert? > > I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned. I guess my real question is whether or not it can be considered normal behavior to return null. It seems it should never happen, which is why I was suggesting an assert. >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213: >> >>> 211: >>> 212: public boolean isUncommonTrapBlob() { >>> 213: if (!VM.getVM().isServerCompiler()) return false; >> >> Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below? > > `UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`. > Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect. Ok. Leaving UncommonTrapKind and ExceptionKind uninitialized seems a bit error prone. Perhaps they can be given some sort of INVALID value. >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95: >> >>> 93: } >>> 94: >>> 95: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) { >> >> I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr". That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);` > > `cbPc` with comment explaining that it could be inside code blob. That sounds fine. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953818292 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953819796 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953821968 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953822595 From rrich at openjdk.org Thu Feb 13 06:52:10 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 13 Feb 2025 06:52:10 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 12 Feb 2025 21:11:36 GMT, Dean Long wrote: > I just pushed a fix for the s390 and ppc bounds check logic, but I'm still not sure if I am using the correct values for the end of the frame. Testing on ppc64 looks good so far. Will put the change through our CI testing. > The asserts should pass with the deoptimization.cpp fix. The 2nd assert should fail w/o the deoptimization.cpp fix when running the new test. The 2nd assert does not fail w/o the deoptimization.cpp fix. Might be due to alignement of caller->sp() in the interpreter. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2655682065 From jrose at openjdk.org Thu Feb 13 07:44:19 2025 From: jrose at openjdk.org (John R Rose) Date: Thu, 13 Feb 2025 07:44:19 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero VM build I've read the code and it looks good. I find myself wishing for a few more comments to guide me, especially in knowing which methods to pay attention to, and which to ignore as "pure plumbing". The array of vptr-ptrs is the key element. It seems to work nicely. There are lots of regularizations here, which I enjoy. But the new code has (to me) distracting irregularities. Why define one Vptr as a struct and others as classes? Did we really regularize the names of all the print functions (they were irregular before)? I was glad to see lots of magic code deleted from SA. Although, having to look at SA at all is annoying! I noticed a lot of churn in "innocent bystander" client code that looks like this: p2i(_frame.pc()), decode_offset); - nm()->print_on(&ss); + nm()->print_on_v(&ss); nm()->method()->print_codes_on(&ss); What is the client maintainer (or any casual reader) supposed to get from the "_v" suffix? I know we have made the "v/nv" distinction before, but it is rather obscure, not documeted here. Is it described elsewhere in our code base? Our use of it here should be docuemented in codeBlob.hpp. Normally, we try to keep client APIs invariant while doing refactorings like this, so as to avoid touching all the client code. In this case, we have to use a new naming convention to distinguish all versions of (say) print_on: M. The implementation in each CB class K, which can be private if K::Vptr is a friend. P. The public API point, used outside of the CB classes, as well as inside. V. The name of the virtual function defined by each K::Vptr. I would expect I to have the "nice name" like print_on, not print_on_v, while while the private method M would be print_on_impl or print_on_nv, and never called except from Vptr or other methods of the same name. But any convention will work, as long as it is documented and held to consistently. I'm sympathetic to both Andrew's call for maacro-enforced regularity, and Vladimir's objection that macros make things hard to follow. If macros won't work for us here, let's define a documented pattern and stick to it closely, documenting our decisions as we go. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2655760868 From duke at openjdk.org Thu Feb 13 07:55:12 2025 From: duke at openjdk.org (Nicole Xu) Date: Thu, 13 Feb 2025 07:55:12 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 18:55:25 GMT, Emanuel Peter wrote: >> Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: >> >> >> java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 >> >> >> The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. >> >> Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. >> >> Additionally, some defined but unused variables have been removed. > > Oh, the OCA-verify is still stuck. I'm sorry about that ? > I pinged my manager @TobiHartmann , he will reach out to see what's the issue. Hi @eme64, do you see any risks here? Would you please help to review the patch? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2655780275 From aboldtch at openjdk.org Thu Feb 13 08:32:21 2025 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Thu, 13 Feb 2025 08:32:21 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero VM build Similar to what @rose00 noted I think the `_v` and `_nv` suffixes are unfortunate in the public API. Maybe it we could add a protected `x_impl` containing the implementation, then dispatch to the correct one based on _kind, using the Vptr abstraction. And have the normal print_on method use this. We could let our leaf types to directly call the specific implementation, not that I think that our print functions require compile time devirtualisation. There are many solutions here with their pros and cons. src/hotspot/share/code/codeBlob.hpp line 140: > 138: instance->print_value_on_nv(st); > 139: } > 140: }; I wonder why the base class is not abstract. AFAICT `print_value_on` is unreachable and `print_on` is only used by `DeoptimizationBlob::Vptr` which also seems like a behavioural change, as before this patch calling `print_on` a `DeoptimizationBlob` object would dispatch to `SingletonBlob::print_on` not `CodeBlob::print_on`. Suggestion: struct Vptr { virtual void print_on(const CodeBlob* instance, outputStream* st) const = 0; virtual void print_value_on(const CodeBlob* instance, outputStream* st) const = 0; }; src/hotspot/share/code/codeBlob.hpp line 339: > 337: void print_value_on(outputStream* st) const; > 338: > 339: class Vptr : public CodeBlob::Vptr { I wonder if these should share the same type hierarchy as tier container class. This would also solve the issueI noted in my other comment about not calling the correct `print_on`. Suggestion: class Vptr : public RuntimeBlob::Vptr { src/hotspot/share/code/codeBlob.hpp line 427: > 425: void print_value_on(outputStream* st) const; > 426: > 427: class Vptr : public CodeBlob::Vptr { Suggestion: class Vptr : public RuntimeBlob::Vptr { src/hotspot/share/code/codeBlob.hpp line 467: > 465: void print_value_on(outputStream* st) const; > 466: > 467: class Vptr : public CodeBlob::Vptr { Suggestion: class Vptr : public RuntimeBlob::Vptr { src/hotspot/share/code/codeBlob.hpp line 553: > 551: void print_value_on(outputStream* st) const; > 552: > 553: class Vptr : public CodeBlob::Vptr { This one specifically Suggestion: class Vptr : public SingletonBlob::Vptr { src/hotspot/share/code/codeBlob.hpp line 679: > 677: void print_value_on(outputStream* st) const; > 678: > 679: class Vptr : public CodeBlob::Vptr { Suggestion: class Vptr : public RuntimeBlob::Vptr { ------------- PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2614177723 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954019308 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954024528 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954028620 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954028940 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954027733 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954029504 From epeter at openjdk.org Thu Feb 13 08:34:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 08:34:14 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 07:52:30 GMT, Nicole Xu wrote: >> Oh, the OCA-verify is still stuck. I'm sorry about that ? >> I pinged my manager @TobiHartmann , he will reach out to see what's the issue. > > Hi @eme64, do you see any risks here? Would you please help to review the patch? Thanks. @xyyNicole This looks reasonable to me. But I do want the original author @jatin-bhateja to look at it too. I'll send him an email as he has not reacted to pings via GitHub yet ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2655866059 From epeter at openjdk.org Thu Feb 13 08:40:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 08:40:14 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v26] In-Reply-To: References: <6sWBRolcCZXOe1pXDSfyBUvtfEuzV1MdMXUVpji42_4=.6c5f7da0-ee1d-485f-a99d-d8c520002dbd@github.com> Message-ID: On Sat, 1 Feb 2025 19:31:00 GMT, Johannes Graham wrote: >> Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: >> >> add IR tests for long, simplify tests for int > > Thanks. I've done another round of format fixing. I've also simplified the IR tests so they don't try to cover as much as gtest does, and added equivalent tests for long. > > I have temporarily left the more elaborate tests commented out in XorINodeIdealizationTests. I will remove them if nobody thinks they are worth keeping. @j3graham Can you please update the PR description at the top? The current version does not reflect the most up-to-date explanations, right? I would like to see a nice summary, what cases were covered before and what cases you are now covering additionally. Give a quick explanation how you changed the code. I'll increase the number of reviewers as this looks like a substantial change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23089#issuecomment-2655880238 From duke at openjdk.org Thu Feb 13 08:42:24 2025 From: duke at openjdk.org (Nicole Xu) Date: Thu, 13 Feb 2025 08:42:24 GMT Subject: RFR: 8349943: [JMH] Use jvmArgs consistently Message-ID: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. ------------- Commit messages: - 8349943: [JMH] Use jvmArgs consistently Changes: https://git.openjdk.org/jdk/pull/23609/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23609&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349943 Stats: 20 lines in 9 files changed: 2 ins; 0 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/23609.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23609/head:pull/23609 PR: https://git.openjdk.org/jdk/pull/23609 From epeter at openjdk.org Thu Feb 13 08:46:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 08:46:13 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:48:29 GMT, Johannes Graham wrote: >> C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. > > Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: > > formatting, remove commented tests I also see that https://github.com/openjdk/jdk/pull/2776 and https://github.com/openjdk/jdk/pull/4136 were mentioned here. Both of those are related an have no IR tests of their own, yikes! We have to ensure that we cover those old cases, and then new ones here, so that we do not get any accidental regressions. Maybe that's all already covered in other existing tests or the tests you added. Can you please provide a summary of all tests and what cases they cover in the PR description? It would help a lot for reviewing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23089#issuecomment-2655893261 From aph at openjdk.org Thu Feb 13 08:52:09 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 13 Feb 2025 08:52:09 GMT Subject: RFR: 8347917: AArch64: Enable upper GPR registers in C1 [v4] In-Reply-To: References: Message-ID: On Thu, 30 Jan 2025 08:32:25 GMT, Dmitry Chuyko wrote: >> This small change enables upper GPR registers in C1 so they are used, and used similar to C2. r19-r26 are declared as caller-saved and enabled, r27 (rheapbase) is declared caller-saved, r27 (rheapbase) and r29 (fp) are enabled conditionally similar to C2. r29 is already handled in MacroAssembler::build_frame()/remove_frame(). >> >> r18 is excluded on masOS and Windows as before. r27 is excluded when `UseCompressedOops` is on and `CompressedOops::base() != nullptr,` r29 is excluded when `PreserveFramePointer` is on. >> >> Registers are declared caller-saved in c1_FrameMap_aarch64.cpp, conditionally enabled ones are in the tail of enabled range which is adjusted in c1_FrameMap_aarch64.hpp, the code there was made similar to x86 (JDK-6985015). >> >> Register ranges are also updated in the linear scan itself and in OOP map generation. >> >> Having more allocatable registers help to avoid spills in register hungry code and thus improve performance and code density and simplify compilation. In practice the code that operates so many values is not too frequent and upper registers are used less frequently than first ones. To perform testing it turned to be useful to run C1 in a special mode when registers are allocated from upper to lower in LinearScanWalker::find_free_reg(): >> >> >> - for (int i = _first_reg; i <= _last_reg; i++) { >> + for (int i = _last_reg; i >= _first_reg; i--) { >> >> >> It was also useful to run the JVM with C1 compilation only and with different GCs and small heaps like `-XX:TieredStopAtLevel=1 -Xmx256m -XX:+UseSerialGC`. >> >> Tier1-3 jtreg tests showed no regression on linux-aarch64 (release, slowdebug, Xcomp) with either direct or reversed register allocation order. Windows and macOS were also tested to check r18 handling, +-CompressedOops and +-PreserveFramePointer combinations were tested. >> >> SHA3 Java implementation is as an example of register hungry code. Throughput results greatly depend on the actual CPU being used. On Graviton 2 the improvement in the dedicated micro-benchmark is ~**19%** for longer arrays (`-XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:-UseSHA3Intrinsics -jar ../benchmarks.jar -f 1 -wi 2 -i 3 -p digesterName=SHA3-256 -p length=16384 -jvmArgsAppend="-XX:-UseCompressedOops -XX:-PreserveFramePointer -Xmx31g -Xlog:gc+heap+coops=debug" MessageDigests.digest$`). > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Accurate caller-saved regs definition On 2/12/25 22:25, Dmitry Chuyko wrote: > Just a few things to keep here: > > 1. Even for aarch64 just reversing allocation order is not enough (callee preserved regs are saved in a caller). > 2. Register saving overhead for runtime calls is there, but making a call without saving is still expensive. I don't quite understand what you're saying here. In the first sentence you seem to imply that callee preserved regs are still saved in the caller, unnecessarily. In the second sentence you say "saving overhead for runtime calls is there," which seems to imply that there is some advantage to using a callee-saved register for runtime calls. Clearly this issue only applies to runtime calls, because Java has no callee preserved regs. What conclusion do you make from the benchmark you presented? That the overhead of making a call from C1-compiled code is great, especially when there are many spills? -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23152#issuecomment-2655906265 From mli at openjdk.org Thu Feb 13 09:04:24 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 13 Feb 2025 09:04:24 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v3] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? > This optimization is mainly for the vector API. > On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). > > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% > ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% > DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% > DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% > FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% > FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% > IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% > IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% > LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% > LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% > ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% > ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: mimor ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23580/files - new: https://git.openjdk.org/jdk/pull/23580/files/5d460b1c..0dedc1bc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=01-02 Stats: 20 lines in 3 files changed: 0 ins; 0 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/23580.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23580/head:pull/23580 PR: https://git.openjdk.org/jdk/pull/23580 From mli at openjdk.org Thu Feb 13 09:04:25 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 13 Feb 2025 09:04:25 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v2] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 02:41:47 GMT, Fei Yang wrote: > Hi, Some comments after a cursory look. Will have a more closer look later. > > BTW: How should I understand the JMH data? 11170.052 ns/op before compared to 1294.424 ns/op after for ByteMaxVector.MULLanes, but the improvement says 88.40%? Thanks for having a look. `Improvement` is calculated by (`master` - `patch`) / `master`, means how much time is saved by new implementation compared to the old one. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 2961: > >> 2959: BasicType bt, uint vector_length, VectorMask vm) { >> 2960: assert(bt == T_BYTE || bt == T_SHORT || bt == T_INT || bt == T_LONG, "unsupported element type"); >> 2961: uint len = vector_length/type2aelembytes(bt); > > Please put a space before and after the divide operator. fixed. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 3012: > >> 3010: lui(t0, 0x3f800000); // 1.0f >> 3011: } else { >> 3012: lui(t0, 0x3ff00000); // 1.0d > > Can you use `mv` instead of `lui`, like other places? fixed. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 242: > >> 240: VectorMask vm = Assembler::unmasked); >> 241: >> 242: void reduce_mul_integer_v(Register dst, Register src1, VectorRegister src2, > > Or maybe `reduce_mul_integral_v` which will be consistent in naming with friends like `reduce_integral_v`? fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23580#issuecomment-2655934207 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1954089593 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1954089906 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1954089415 From mli at openjdk.org Thu Feb 13 09:09:15 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 13 Feb 2025 09:09:15 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: <70S6L3Fgb47miJyN_xWQtE639pgMLpevdyIbRftQFgs=.4b8fbc73-bb1e-41f3-bb86-804d47b9fb4d@github.com> On Thu, 13 Feb 2025 04:20:40 GMT, Gui Cao wrote: >> src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp line 953: >> >>> 951: __ ld(x17, Address(x17)); >>> 952: __ beq(klass, x17, success); >>> 953: __ j(fail); >> >> Hmm, I think this jump could be saved if we put a direct return here instead. Like: >> >> __ beq(klass, x17, success); >> __ mv(result, 0); >> __ ret(); >> >> What do you think? @Hamlin-Li @zifeihan > > Yes, I think we can put a direct return here. Yes, seems better. I was thinking to move bind(fail) here, but seems current lookup_secondary_supers_table_var only accept a success label. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1954102465 From epeter at openjdk.org Thu Feb 13 09:12:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 09:12:16 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:48:29 GMT, Johannes Graham wrote: >> C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. > > Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: > > formatting, remove commented tests @j3graham Thanks for taking this on! It's great to see someone clean this up and make sure all the cases optimize as expected! src/hotspot/share/opto/addnode.cpp line 986: > 984: > 985: template > 986: static S calc_xor_max(const S hi_0, const S hi_1) { Can we please have a more expressive name? Having `max` in the name can be a little confusing, as it is its own operation, which is not relevant here it seems. What we are really finding is the `hi` of the type after the `xor`. So why not name it `calculate_hi_after_xor`? src/hotspot/share/opto/addnode.cpp line 995: > 993: > 994: // We want to find a value that has all 1 bits everywhere up to and including > 995: // the highest bits set in r0->_hi as well as r1->_hi. For this,we can take the next Suggestion: // the highest bits set in r0->_hi as well as r1->_hi. For this, we can take the next src/hotspot/share/opto/addnode.cpp line 1010: > 1008: if( r0 == TypeInt::BOOL && ( r1 == TypeInt::ONE > 1009: || r1 == TypeInt::BOOL)) > 1010: return TypeInt::BOOL; It looks to me like this case should be covered by `calc_xor_max` below. Do we have any IR tests that verify that this still gets optimized as before? src/hotspot/share/opto/addnode.cpp line 1046: > 1044: jint XorINode::calc_max(const jint hi_0, const jint hi_1) { > 1045: return calc_xor_max(hi_0, hi_1); > 1046: } What is this method used for? Only by the tests? Why not use `calc_xor_max` in the tests directly? src/hotspot/share/opto/addnode.cpp line 1063: > 1061: // Result of xor can only have bits sets where any of the > 1062: // inputs have bits set. lo can always become 0. > 1063: Hmm, I'm not super happy with this comment. It feels like somehow the naming of `calc_xor_max` and the checks around its use are not very "clear" or apparent on a first read. The comments help a little... The comment about `lo can always become 0.` is not generally true, I mean if it is already below 0 which we have not YET checked at the time of the comment, then we cannot just change it to zero. Maybe we want to refactor this logic here. Really you are checking if we have an unsigned case here. And then you are checking if you can constrain the `hi` to the possibly active bits. I mean currently your `calc_xor_max` would even assert if the inputs are negative, so I feel the method name should reflect that expectation as well. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23089#pullrequestreview-2614273871 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954079856 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954080333 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954085505 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954073522 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954097305 From epeter at openjdk.org Thu Feb 13 09:20:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 09:20:11 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 12:07:16 GMT, Jatin Bhateja wrote: >>> @jatin-bhateja Doing the transformation to `AndF` would be a more general solution and thus better. >>> >>> > Introducing another new IR "AndF" will again need changes in auto-vectorizer. >>> >>> But currently, `CopySign` and `MoveF2I` are not vectorized anyway so we can do the vectorization of `AndF` in a separate patch without much hassle. `AndF` is vectorized into existing `AndV` nicely so it is not a too complicated work. >> >> Yes, I have a follow-up patch to auto-vectorized CopySign. >> >>> > this patch does not break existing IR invariants >>> >>> Also, what invariant can be broken by transforming `AndI(MoveF2I(x), MoveF2I(y)` into `MoveF2I(AndF(x, y))`? >> >> Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance. > >> > @jatin-bhateja Doing the transformation to `AndF` would be a more general solution and thus better. >> > > Introducing another new IR "AndF" will again need changes in auto-vectorizer. >> > >> > >> > But currently, `CopySign` and `MoveF2I` are not vectorized anyway so we can do the vectorization of `AndF` in a separate patch without much hassle. `AndF` is vectorized into existing `AndV` nicely so it is not a too complicated work. >> >> Yes, I have a follow-up patch to auto-vectorized CopySign. >> >> > > this patch does not break existing IR invariants >> > >> > >> > Also, what invariant can be broken by transforming `AndI(MoveF2I(x), MoveF2I(y)` into `MoveF2I(AndF(x, y))`? >> >> Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance. > > > Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body. > > For the time being, taking CopySign intrinsic route looks reasonable. @jatin-bhateja let me know when this is ready for more testing / review. Quick comment: it seems you are not just optimizing Math.copySign as the PR title says, but also adding vector nodes. Maybe you should update the PR title? Have not looked at the code in detail to suggest a better one yet ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2655983534 From epeter at openjdk.org Thu Feb 13 09:26:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 09:26:17 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v18] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> <90MwDac7Q83dK8KDagHOst15xV-quGZKVE8n2tP9dsk=.351ed042-9a69-4186-b134-8c3cb6fef6cd@github.com> Message-ID: On Wed, 12 Feb 2025 12:17:29 GMT, Jatin Bhateja wrote: >> @jatin-bhateja Tests look all good on my side. I'll make another pass in the next few days, and hopefully approve. > > Hi @eme64 , All comments addressed, looking forward to your approval @jatin-bhateja Perfect, it looks good now. Let me run testing one more time just to be sure. Please ping me in a day or so for the results! ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2655997017 From qamai at openjdk.org Thu Feb 13 09:34:17 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 13 Feb 2025 09:34:17 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 08:54:51 GMT, Emanuel Peter wrote: >> Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: >> >> formatting, remove commented tests > > src/hotspot/share/opto/addnode.cpp line 986: > >> 984: >> 985: template >> 986: static S calc_xor_max(const S hi_0, const S hi_1) { > > Can we please have a more expressive name? Having `max` in the name can be a little confusing, as it is its own operation, which is not relevant here it seems. What we are really finding is the `hi` of the type after the `xor`. So why not name it `calculate_hi_after_xor`? I think the name is good enough, calculate the max of a xor is a pretty self-explanatory name. You just need a better description for the method. Suggestion: Given 2 non-negative values in the ranges [0, hi_0] and [0, hi_1], respectively. The bitwise xor of these values should also be non-negative. This method calculates its maximum. > src/hotspot/share/opto/addnode.cpp line 1046: > >> 1044: jint XorINode::calc_max(const jint hi_0, const jint hi_1) { >> 1045: return calc_xor_max(hi_0, hi_1); >> 1046: } > > What is this method used for? Only by the tests? Why not use `calc_xor_max` in the tests directly? `calc_xor_max` is a static method in the cpp file and thus it is not preferrable to import it directly into the test file. > src/hotspot/share/opto/addnode.cpp line 1063: > >> 1061: // Result of xor can only have bits sets where any of the >> 1062: // inputs have bits set. lo can always become 0. >> 1063: > > Hmm, I'm not super happy with this comment. It feels like somehow the naming of `calc_xor_max` and the checks around its use are not very "clear" or apparent on a first read. The comments help a little... > > The comment about `lo can always become 0.` is not generally true, I mean if it is already below 0 which we have not YET checked at the time of the comment, then we cannot just change it to zero. > > Maybe we want to refactor this logic here. > Really you are checking if we have an unsigned case here. And then you are checking if you can constrain the `hi` to the possibly active bits. > > I mean currently your `calc_xor_max` would even assert if the inputs are negative, so I feel the method name should reflect that expectation as well. I think the comment should be removed, the specs and implementation of `calc_xor_max` should be referred to when trying to understand this piece of code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954132494 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954135679 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954143818 From dchuyko at openjdk.org Thu Feb 13 09:35:13 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 13 Feb 2025 09:35:13 GMT Subject: RFR: 8347917: AArch64: Enable upper GPR registers in C1 [v4] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 08:49:40 GMT, Andrew Haley wrote: > On 2/12/25 22:25, Dmitry Chuyko wrote: Just a few things to keep here: 1. Even for aarch64 just reversing allocation order is not enough (callee preserved regs are saved in a caller). 2. Register saving overhead for runtime calls is there, but making a call without saving is still expensive. > I don't quite understand what you're saying here. In the first sentence you seem to imply that callee preserved regs are still saved in the caller, unnecessarily. Yes. There is some other place to be changed. > In the second sentence you say "saving overhead for runtime calls is there," which seems to imply that there is some advantage to using a callee-saved register for runtime calls. Clearly this issue only applies to runtime calls, because Java has no callee preserved regs. What conclusion do you make from the benchmark you presented? That the overhead of making a call from C1-compiled code is great, especially when there are many spills? Speculatively it's like having that call costs ~4ns/op, and preserving unnecessary values costs extra ~1ns/op. Preserving unnecessary values also costs a lot of instructions. This is definitely a subject for a separate further study, I just checked that if we can observe any difference in benchmarks (yes), and is reversing allocation order currently enough to help (no). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23152#issuecomment-2656018287 From epeter at openjdk.org Thu Feb 13 09:50:22 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 09:50:22 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v22] In-Reply-To: References: Message-ID: On Sat, 8 Feb 2025 18:30:56 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > Reword correctness (fixes). I was thinking of how we could get this as short and simple as possible. Let me know what you think about this: // Check if expr is a neutral additive element under mask. We have // (expr + addend) & mask // and we would like to know that for any addend, this is equivalent to // addend & mask // // Let M be the smallest power of 2 greater or equal to mask. // Let m = M-1 be the bitmask for modular arithmetic modulo M. // We assume that M is positive, and so we can apply the rules from unsigned // modular arithmetic: // (x + y) % M = ((x % M) + y) % M // or using the mask: // (x + y) & m = ((x & m) + y) & m // // Note that mask only can have one bits where m has any. Hence: // (expr + addend) & mask // = (expr + addend) & mask & m // = ((expr & m) + addend) & mask & m // = ((expr & m) + addend) & mask // And if we can prove that // (expr & m) = 0 // Then // (expr + addend) & mask = addend & mask Then you could just prove `(expr & m) = 0` which is a little simpler on its own ;) Anyway, I'll leave it here. @merykitty has also given a suggestion, so it's now up to you. If you feel frustrated with the math, then maybe we can help out more, let us know ? I know it can be hard to get nice definitions, and proofs. But I think it is helpful, especially if something is wrong later, then it is easier to see what the author intended to do, and what they had assumed. ------------- PR Review: https://git.openjdk.org/jdk/pull/22856#pullrequestreview-2614440881 From epeter at openjdk.org Thu Feb 13 09:53:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 09:53:15 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:24:59 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/addnode.cpp line 986: >> >>> 984: >>> 985: template >>> 986: static S calc_xor_max(const S hi_0, const S hi_1) { >> >> Can we please have a more expressive name? Having `max` in the name can be a little confusing, as it is its own operation, which is not relevant here it seems. What we are really finding is the `hi` of the type after the `xor`. So why not name it `calculate_hi_after_xor`? > > I think the name is good enough, calculate the max of a xor is a pretty self-explanatory name. You just need a better description for the method. Suggestion: > > Given 2 non-negative values in the ranges [0, hi_0] and [0, hi_1], respectively. The bitwise xor of these values should also be non-negative. This method calculates its maximum. What about calling it `calculate_upper_bound_of_xor_with_non_negative_inputs`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954177610 From epeter at openjdk.org Thu Feb 13 09:57:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 09:57:16 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:26:42 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/addnode.cpp line 1046: >> >>> 1044: jint XorINode::calc_max(const jint hi_0, const jint hi_1) { >>> 1045: return calc_xor_max(hi_0, hi_1); >>> 1046: } >> >> What is this method used for? Only by the tests? Why not use `calc_xor_max` in the tests directly? > > `calc_xor_max` is a static method in the cpp file and thus it is not preferrable to import it directly into the test file. I suppose this is more of a cosmetic concern, not as important. Up to you what you want to do @j3graham . >> src/hotspot/share/opto/addnode.cpp line 1063: >> >>> 1061: // Result of xor can only have bits sets where any of the >>> 1062: // inputs have bits set. lo can always become 0. >>> 1063: >> >> Hmm, I'm not super happy with this comment. It feels like somehow the naming of `calc_xor_max` and the checks around its use are not very "clear" or apparent on a first read. The comments help a little... >> >> The comment about `lo can always become 0.` is not generally true, I mean if it is already below 0 which we have not YET checked at the time of the comment, then we cannot just change it to zero. >> >> Maybe we want to refactor this logic here. >> Really you are checking if we have an unsigned case here. And then you are checking if you can constrain the `hi` to the possibly active bits. >> >> I mean currently your `calc_xor_max` would even assert if the inputs are negative, so I feel the method name should reflect that expectation as well. > > I think the comment should be removed, the specs and implementation of `calc_xor_max` should be referred to when trying to understand this piece of code. I think a better name like `calculate_upper_bound_of_xor_with_non_negative_inputs` would help a lot, and make some comments redundant. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954181880 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1954183571 From dnsimon at openjdk.org Thu Feb 13 10:04:20 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Feb 2025 10:04:20 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong Message-ID: The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. ------------- Commit messages: - converted JVMCIRuntime::_shared_library_javavm_id to jlong Changes: https://git.openjdk.org/jdk/pull/23610/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23610&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349977 Stats: 7 lines in 3 files changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23610.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23610/head:pull/23610 PR: https://git.openjdk.org/jdk/pull/23610 From aph at openjdk.org Thu Feb 13 10:08:14 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 13 Feb 2025 10:08:14 GMT Subject: RFR: 8347917: AArch64: Enable upper GPR registers in C1 [v4] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:32:49 GMT, Dmitry Chuyko wrote: > Speculatively it's like having that call costs ~4ns/op, and preserving unnecessary values costs extra ~1ns/op. Preserving unnecessary values also costs a lot of instructions. > > This is definitely a subject for a separate further study, I just checked that if we can observe any difference in benchmarks (yes), and is reversing allocation order currently enough to help (no). OK, I get it. So it's not clear whether this is worth addressing in C1, and maybe someone will get around to it when higher-priority tasks are dealt with. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23152#issuecomment-2656083808 From aph at openjdk.org Thu Feb 13 10:08:16 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 13 Feb 2025 10:08:16 GMT Subject: RFR: 8347917: AArch64: Enable upper GPR registers in C1 [v4] In-Reply-To: References: Message-ID: On Thu, 30 Jan 2025 08:32:25 GMT, Dmitry Chuyko wrote: >> This small change enables upper GPR registers in C1 so they are used, and used similar to C2. r19-r26 are declared as caller-saved and enabled, r27 (rheapbase) is declared caller-saved, r27 (rheapbase) and r29 (fp) are enabled conditionally similar to C2. r29 is already handled in MacroAssembler::build_frame()/remove_frame(). >> >> r18 is excluded on masOS and Windows as before. r27 is excluded when `UseCompressedOops` is on and `CompressedOops::base() != nullptr,` r29 is excluded when `PreserveFramePointer` is on. >> >> Registers are declared caller-saved in c1_FrameMap_aarch64.cpp, conditionally enabled ones are in the tail of enabled range which is adjusted in c1_FrameMap_aarch64.hpp, the code there was made similar to x86 (JDK-6985015). >> >> Register ranges are also updated in the linear scan itself and in OOP map generation. >> >> Having more allocatable registers help to avoid spills in register hungry code and thus improve performance and code density and simplify compilation. In practice the code that operates so many values is not too frequent and upper registers are used less frequently than first ones. To perform testing it turned to be useful to run C1 in a special mode when registers are allocated from upper to lower in LinearScanWalker::find_free_reg(): >> >> >> - for (int i = _first_reg; i <= _last_reg; i++) { >> + for (int i = _last_reg; i >= _first_reg; i--) { >> >> >> It was also useful to run the JVM with C1 compilation only and with different GCs and small heaps like `-XX:TieredStopAtLevel=1 -Xmx256m -XX:+UseSerialGC`. >> >> Tier1-3 jtreg tests showed no regression on linux-aarch64 (release, slowdebug, Xcomp) with either direct or reversed register allocation order. Windows and macOS were also tested to check r18 handling, +-CompressedOops and +-PreserveFramePointer combinations were tested. >> >> SHA3 Java implementation is as an example of register hungry code. Throughput results greatly depend on the actual CPU being used. On Graviton 2 the improvement in the dedicated micro-benchmark is ~**19%** for longer arrays (`-XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:-UseSHA3Intrinsics -jar ../benchmarks.jar -f 1 -wi 2 -i 3 -p digesterName=SHA3-256 -p length=16384 -jvmArgsAppend="-XX:-UseCompressedOops -XX:-PreserveFramePointer -Xmx31g -Xlog:gc+heap+coops=debug" MessageDigests.digest$`). > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Accurate caller-saved regs definition src/hotspot/share/c1/c1_Compiler.cpp line 54: > 52: BufferBlob* buffer_blob = CompilerThread::current()->get_buffer_blob(); > 53: FrameMap::initialize(); > 54: Runtime1::initialize(buffer_blob); Why this change? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23152#discussion_r1954195237 From aph at openjdk.org Thu Feb 13 10:08:16 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 13 Feb 2025 10:08:16 GMT Subject: RFR: 8347917: AArch64: Enable upper GPR registers in C1 [v4] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 10:01:16 GMT, Andrew Haley wrote: >> Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: >> >> Accurate caller-saved regs definition > > src/hotspot/share/c1/c1_Compiler.cpp line 54: > >> 52: BufferBlob* buffer_blob = CompilerThread::current()->get_buffer_blob(); >> 53: FrameMap::initialize(); >> 54: Runtime1::initialize(buffer_blob); > > Why this change? Ah, I guess it's becuase you refer to `FrameMap::caller_save_cpu_reg_at(i)` in your code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23152#discussion_r1954200390 From dchuyko at openjdk.org Thu Feb 13 10:08:16 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 13 Feb 2025 10:08:16 GMT Subject: RFR: 8347917: AArch64: Enable upper GPR registers in C1 [v4] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 10:04:10 GMT, Andrew Haley wrote: >> src/hotspot/share/c1/c1_Compiler.cpp line 54: >> >>> 52: BufferBlob* buffer_blob = CompilerThread::current()->get_buffer_blob(); >>> 53: FrameMap::initialize(); >>> 54: Runtime1::initialize(buffer_blob); >> >> Why this change? > > Ah, I guess it's becuase you refer to `FrameMap::caller_save_cpu_reg_at(i)` in your code. Yes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23152#discussion_r1954202897 From epeter at openjdk.org Thu Feb 13 10:16:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 10:16:14 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v4] In-Reply-To: <0iE7uPGSBpBdlgayY_gqBpuPay-XSpjMdaOuqdo-nhs=.1c7fa2cb-f1ea-4810-8fe6-2e0e6af7b8ac@github.com> References: <0iE7uPGSBpBdlgayY_gqBpuPay-XSpjMdaOuqdo-nhs=.1c7fa2cb-f1ea-4810-8fe6-2e0e6af7b8ac@github.com> Message-ID: On Sun, 9 Feb 2025 06:03:03 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: >> >> >> Baseline Patch >> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement >> VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) >> VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) >> VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) >> >> >> I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Add new conversions to benchmark @jaskarth That looks great! Thanks for the updates! A little control question: We now add additional vector operations into the loop body. How do we know this will not lead to a regression in some cases? I think it should not... right? Are the casts no-ops, or could they have some cost to them? Could there be any case where the wins from vectorization would not make up for the extra vector cast node? src/hotspot/cpu/x86/matcher_x86.hpp line 273: > 271: if (to_bt == from_bt) { > 272: return false; > 273: } Hmm, do we expect that this ever gets triggered? Or would that be a bug? Maybe not, but could be worth adding a defensice assert here, what do you think? src/hotspot/cpu/x86/matcher_x86.hpp line 284: > 282: return false; > 283: } > 284: } You could leave a quick comment about why `CHAR` is not yet covered here. src/hotspot/share/opto/vtransform.hpp line 537: > 535: virtual VTransformApplyResult apply(const VLoopAnalyzer& vloop_analyzer, > 536: const GrowableArray& vnode_idx_to_transformed_node) const override; > 537: NOT_PRODUCT(virtual const char* name() const override { return "Cast"; };) Suggestion: NOT_PRODUCT(virtual const char* name() const override { return "CastVector"; };) test/hotspot/jtreg/compiler/loopopts/superword/TestCompatibleUseDefTypeSize.java line 333: > 331: applyIfPlatform = {"64-bit", "true"}, > 332: applyIf = {"AlignVector", "false"}, > 333: applyIfCPUFeature = {"avx", "true"}) This may be a little nit-picky. But why have a new test-file when this test here was already trying to cover the conversion cases? I think I wrote it back then, and was just too lazy to write all conversion cases. I'd suggest you move your cases up here ;) ------------- PR Review: https://git.openjdk.org/jdk/pull/23413#pullrequestreview-2614468035 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1954188505 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1954195558 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1954203200 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1954207931 From epeter at openjdk.org Thu Feb 13 10:16:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 10:16:15 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v4] In-Reply-To: References: <0iE7uPGSBpBdlgayY_gqBpuPay-XSpjMdaOuqdo-nhs=.1c7fa2cb-f1ea-4810-8fe6-2e0e6af7b8ac@github.com> Message-ID: On Thu, 13 Feb 2025 10:06:04 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Add new conversions to benchmark > > src/hotspot/share/opto/vtransform.hpp line 537: > >> 535: virtual VTransformApplyResult apply(const VLoopAnalyzer& vloop_analyzer, >> 536: const GrowableArray& vnode_idx_to_transformed_node) const override; >> 537: NOT_PRODUCT(virtual const char* name() const override { return "Cast"; };) > > Suggestion: > > NOT_PRODUCT(virtual const char* name() const override { return "CastVector"; };) I know the node is not inheriting from `VTransformVectorNode`, but we can make that happen with a future refactoring I'm already working on. > test/hotspot/jtreg/compiler/loopopts/superword/TestCompatibleUseDefTypeSize.java line 333: > >> 331: applyIfPlatform = {"64-bit", "true"}, >> 332: applyIf = {"AlignVector", "false"}, >> 333: applyIfCPUFeature = {"avx", "true"}) > > This may be a little nit-picky. But why have a new test-file when this test here was already trying to cover the conversion cases? I think I wrote it back then, and was just too lazy to write all conversion cases. I'd suggest you move your cases up here ;) I think I added these tests when I was reworking `SuperWord::is_velt_basic_type_compatible_use_def`, which you are now touching as well ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1954204109 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1954209856 From epeter at openjdk.org Thu Feb 13 10:49:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 10:49:14 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v7] In-Reply-To: <3cT_HJ9dj5J4NFrLzmvYUdUy4uee6Ltcm6d20YP3jm0=.aa20c25e-c097-4e59-9d82-12aa2c3b4422@github.com> References: <3cT_HJ9dj5J4NFrLzmvYUdUy4uee6Ltcm6d20YP3jm0=.aa20c25e-c097-4e59-9d82-12aa2c3b4422@github.com> Message-ID: On Fri, 7 Feb 2025 09:52:51 GMT, Roland Westrelin wrote: >> This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and >> `Value` because the `int` and `long` versions are very similar and so >> there's no logic duplication. In the process, support for some extra >> transformations is added to `RShiftL`. I also added some new test >> cases. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review @rwestrel nice work, looks like a good step to unify the code a little! I left some comments / suggestions. I'm also wondering about testing. How good do you think test coverage is? Are all cases covered? How about the edge-cases? Could we improve the coverage with randomization somehow? src/hotspot/share/opto/mulnode.cpp line 1311: > 1309: } > 1310: > 1311: Node *RShiftNode::IdealIL(PhaseGVN* phase, bool can_reshape, BasicType bt) { Suggestion: Node* RShiftNode::IdealIL(PhaseGVN* phase, bool can_reshape, BasicType bt) { src/hotspot/share/opto/mulnode.cpp line 1317: > 1315: return NodeSentinel; // Left input is an integer > 1316: } > 1317: const TypeInteger* t3; // type of in(1).in(2) I know that you only moved this code, but it looks bad ? For one, why is it defined up here already when it is only used 10 lines later? And why not give it a better name so we don't need the comment? Suggestion: src/hotspot/share/opto/mulnode.cpp line 1329: > 1327: (t3 = phase->type(mask->in(2))->isa_integer(bt)) && > 1328: t3->is_con()) { > 1329: jlong maskbits = t3->get_con_as_long(bt); This is also quite bad. It seems `mask` here is `in(1)`, which is not even the mask at all, but `x & `. I'd suggest to clean it up a little and use better names. src/hotspot/share/opto/mulnode.cpp line 1330: > 1328: t3->is_con()) { > 1329: jlong maskbits = t3->get_con_as_long(bt); > 1330: // Convert to "(x >> shift) & (mask >> shift)" This is a nice comment. It could come as a motivation above. Because it suggests that we can then constant fold the `mask >> shift`, right? src/hotspot/share/opto/mulnode.cpp line 1383: > 1381: > 1382: const TypeInteger* r1 = t1->isa_integer(bt); // Handy access > 1383: const TypeInt* r2 = t2->isa_int(); // Handy access Suggestion: const TypeInteger* r1 = t1->isa_integer(bt); const TypeInt* r2 = t2->isa_int(); Let's reduce the noise a little. src/hotspot/share/opto/mulnode.cpp line 1462: > 1460: return progress; > 1461: } > 1462: const TypeInt* t3; // type of in(1).in(2) Also refactor the use of `t3` here, please. test/hotspot/jtreg/compiler/c2/irTests/RShiftLNodeIdealizationTests.java line 40: > 38: } > 39: > 40: @Run(test = { "test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8", "test9" }) You should add the bug id above. test/hotspot/jtreg/compiler/c2/irTests/RShiftLNodeIdealizationTests.java line 119: > 117: final int test7Shift = 42; > 118: final long test7Min = -1L << (64 - test7Shift -1); > 119: final long test7Max = ~test7Min; Could we randomize these tests, so that we would get better coverage? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23438#pullrequestreview-2614514492 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1954216080 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1954225487 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1954236168 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1954235878 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1954242414 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1954253473 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1954262666 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1954264026 From qamai at openjdk.org Thu Feb 13 11:04:10 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 13 Feb 2025 11:04:10 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: <6I-Otx3thFLIcBATF5ggyk5fHlEQyx-NXJ2sNW_pVsE=.c7d8b1ba-1021-4593-93c7-b61636b98a7e@github.com> On Wed, 12 Feb 2025 12:07:16 GMT, Jatin Bhateja wrote: > Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body. That means we can improve the generation of floating-point constants. The reason I object this approach is that it is short-sighted. It's not like we cannot generate similar machine code with the more general approach. Furthermore, after we do `AndF` transformations, this patch is redundant and can be removed entirely. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2656240490 From gcao at openjdk.org Thu Feb 13 11:12:56 2025 From: gcao at openjdk.org (Gui Cao) Date: Thu, 13 Feb 2025 11:12:56 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: <70S6L3Fgb47miJyN_xWQtE639pgMLpevdyIbRftQFgs=.4b8fbc73-bb1e-41f3-bb86-804d47b9fb4d@github.com> References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> <70S6L3Fgb47miJyN_xWQtE639pgMLpevdyIbRftQFgs=.4b8fbc73-bb1e-41f3-bb86-804d47b9fb4d@github.com> Message-ID: On Thu, 13 Feb 2025 09:06:36 GMT, Hamlin Li wrote: >> Yes, I think we can put a direct return here. > > Yes, seems better. > I was thinking to move bind(fail) here, but seems current lookup_secondary_supers_table_var only accept a success label. > Hmm, I think this jump could be saved if we put a direct return here instead. Like: > > ``` > __ beq(klass, x17, success); > __ mv(result, 0); > __ ret(); > ``` > > What do you think? @Hamlin-Li @zifeihan Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23551#discussion_r1954301888 From gcao at openjdk.org Thu Feb 13 11:12:56 2025 From: gcao at openjdk.org (Gui Cao) Date: Thu, 13 Feb 2025 11:12:56 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: > Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. > > > ### JMH numbers (tested on milkv megrez with hotspot client build): > > #### before this patch: > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op > SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op > SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op > SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op > SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op > SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op > SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op > SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op > SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op > SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op > SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op > SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op > SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op > SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op > SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op > SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op > SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op > SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op > SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op > SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op > SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op > SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op > SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op > SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op > SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op > SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op > SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op > SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op > SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op > SecondarySupersLookup.testPositive03 avgt 15 63.469 ? 0.080 ns/op > SecondarySupersLookup.testPositive04 avgt 15 63.896 ... Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Update for RealFYang's comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23551/files - new: https://git.openjdk.org/jdk/pull/23551/files/2ec9dac1..dd664116 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23551&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23551&range=01-02 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23551.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23551/head:pull/23551 PR: https://git.openjdk.org/jdk/pull/23551 From fyang at openjdk.org Thu Feb 13 11:15:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 13 Feb 2025 11:15:10 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Thu, 13 Feb 2025 11:12:56 GMT, Gui Cao wrote: >> Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. >> >> >> ### JMH numbers (tested on milkv megrez with hotspot client build): >> >> #### before this patch: >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op >> SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op >> SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op >> SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op >> SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op >> SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op >> SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op >> SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op >> SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op >> SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op >> SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op >> SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op >> SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op >> SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op >> SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op >> SecondarySupersLookup.testPositive03 avgt 15 63.... > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Update for RealFYang's comment Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23551#pullrequestreview-2614661688 From mli at openjdk.org Thu Feb 13 11:35:12 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 13 Feb 2025 11:35:12 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Thu, 13 Feb 2025 11:12:56 GMT, Gui Cao wrote: >> Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. >> >> >> ### JMH numbers (tested on milkv megrez with hotspot client build): >> >> #### before this patch: >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op >> SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op >> SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op >> SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op >> SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op >> SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op >> SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op >> SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op >> SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op >> SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op >> SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op >> SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op >> SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op >> SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op >> SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op >> SecondarySupersLookup.testPositive03 avgt 15 63.... > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Update for RealFYang's comment Marked as reviewed by mli (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23551#pullrequestreview-2614706927 From bkilambi at openjdk.org Thu Feb 13 11:38:09 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 13 Feb 2025 11:38:09 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 01:47:10 GMT, Xiaohong Gong wrote: > Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures. > > The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture, with different UseSVE options. Here is the gain details: > > > Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2 > ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x > ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x > ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x > ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x > ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x > ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x > ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x > ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x > ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x > ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x > ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x > ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x > IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x > IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x > IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x > IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x > IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x > IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x > IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x > IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x > IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x > IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x > IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x > IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x > LongMaxVector.SADD 1024 thrpt 30 82.01x 81.74x 82.153x > LongMaxVector... src/hotspot/cpu/aarch64/aarch64_vector.ad line 257: > 255: case Op_ExpandBitsV: > 256: return false; > 257: case Op_SaturatingAddV: unsigned saturating addition and subtraction also require SVE2 to be enabled. Can they also be added here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23608#discussion_r1954337746 From bkilambi at openjdk.org Thu Feb 13 11:53:13 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 13 Feb 2025 11:53:13 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 01:47:10 GMT, Xiaohong Gong wrote: > Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures. > > The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture, with different UseSVE options. Here is the gain details: > > > Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2 > ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x > ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x > ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x > ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x > ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x > ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x > ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x > ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x > ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x > ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x > ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x > ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x > IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x > IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x > IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x > IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x > IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x > IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x > IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x > IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x > IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x > IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x > IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x > IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x > LongMaxVector.SADD 1024 thrpt 30 82.01x 81.74x 82.153x > LongMaxVector... src/hotspot/cpu/aarch64/aarch64_vector.ad line 1574: > 1572: instruct vsqadd_masked(vReg dst_src1, vReg src2, pRegGov pg) %{ > 1573: predicate(UseSVE == 2 && !n->as_SaturatingVector()->is_unsigned()); > 1574: match(Set dst_src1 (SaturatingAddV (Binary dst_src1 src2) pg)); for the masked match rules, should we also add `USE_DEF` effect for `dst_src1` to indicate that this register is both read and written to destructively ? I see that other similarly defined match rules in the ad file do not have this effect defined but I am wondering if this should be done? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23608#discussion_r1954356677 From bulasevich at openjdk.org Thu Feb 13 11:58:23 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 13 Feb 2025 11:58:23 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v10] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> Message-ID: On Thu, 13 Feb 2025 03:06:58 GMT, Vladimir Kozlov wrote: > Looks good. I will submit testing. Thank you! The change is not yet ready for final testing. I still need to remove my raw access workaround in nmethod::oop_at and rebase onto #23512 once it has been integrated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2656371541 From jbhateja at openjdk.org Thu Feb 13 12:15:19 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 13 Feb 2025 12:15:19 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: <52HO_iL9asn1huCdJj82R1AwF1w8ON9HZetrdc9rQyQ=.28e137e0-a7f7-4839-a3e7-eda4f8a6c4f5@github.com> On Thu, 13 Feb 2025 12:06:09 GMT, Jatin Bhateja wrote: >> Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: >> >> >> java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 >> >> >> The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. >> >> Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. >> >> Additionally, some defined but unused variables have been removed. > > test/micro/org/openjdk/bench/jdk/incubator/vector/MaskedLogicOpts.java line 122: > >> 120: @Setup(Level.Invocation) >> 121: public void init_per_invoc() { >> 122: int512_arr_idx = (int512_arr_idx + 16) & (ARRAYLEN-1); > > Benchmark assumes that ARRAYLEN is a POT value, thus it will also be good to use the modulous operator for rounding here, it will be expensive but will not impact the performance of the Benchmarking kernels. Please try with following command line `java -jar target/benchmarks.jar -f 1 -i 2 -wi 1 -w 30 -p ARRAYLEN=30 MaskedLogic` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22963#discussion_r1954384129 From jbhateja at openjdk.org Thu Feb 13 12:15:19 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 13 Feb 2025 12:15:19 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Wed, 8 Jan 2025 09:04:47 GMT, Nicole Xu wrote: > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: > > > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > > > The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. > > Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. > > Additionally, some defined but unused variables have been removed. test/micro/org/openjdk/bench/jdk/incubator/vector/MaskedLogicOpts.java line 122: > 120: @Setup(Level.Invocation) > 121: public void init_per_invoc() { > 122: int512_arr_idx = (int512_arr_idx + 16) & (ARRAYLEN-1); Benchmark assumes that ARRAYLEN is a POT value, thus it will also be good to use the modulous operator for rounding here, it will be expensive but will not impact the performance of the Benchmarking kernels. test/micro/org/openjdk/bench/jdk/incubator/vector/MaskedLogicOpts.java line 126: > 124: } > 125: > 126: @CompilerControl(CompilerControl.Mode.INLINE) By making the index hop over 16 ints or 8 longs we may leave gaps in between for 128-bit and 256-bit species, this will unnecessarily include the noise due to cache misses or (on some targets) prefetching additional cache lines which are not usable, thereby impacting the crispness of microbenchmark. test/micro/org/openjdk/bench/jdk/incubator/vector/MaskedLogicOpts.java line 234: > 232: } > 233: > 234: @CompilerControl(CompilerControl.Mode.INLINE) Benchmarking kernels are forced inlined, so passing a species specific index value may help. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22963#discussion_r1954379708 PR Review Comment: https://git.openjdk.org/jdk/pull/22963#discussion_r1954358833 PR Review Comment: https://git.openjdk.org/jdk/pull/22963#discussion_r1954385898 From yzheng at openjdk.org Thu Feb 13 12:35:15 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Thu, 13 Feb 2025 12:35:15 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. LGTM ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/23610#pullrequestreview-2614845493 From dnsimon at openjdk.org Thu Feb 13 12:50:15 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Feb 2025 12:50:15 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. Passes the openjdk-pr-canary: https://github.com/dougxc/openjdk-pr-canary/blob/master/tested-prs/23610/b7a38951a54ff4c1186a3682f717805822575ea8.json ------------- PR Comment: https://git.openjdk.org/jdk/pull/23610#issuecomment-2656499655 From epeter at openjdk.org Thu Feb 13 13:18:37 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 13:18:37 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v7] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 16:40:35 GMT, Quan Anh Mai wrote: >> @merykitty FYI: I'm going on vacation for 3 weeks, so I'll hope to come back to this afterward. > > @eme64 Ping @merykitty Sorry, I've been sick for a week and only just catching up with things again slowly... ------------- PR Comment: https://git.openjdk.org/jdk/pull/17508#issuecomment-2656571281 From mli at openjdk.org Thu Feb 13 14:25:34 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 13 Feb 2025 14:25:34 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector Message-ID: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Hi, Can you help to review the patch? This optimization is mainly for the vector API. Thanks ## Test ### jtreg test/jdk/jdk/incubator/vector/ ### Performance run on bananapi master vs patch Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) -- | -- | -- | -- | -- | -- | -- | -- | -- | -- SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 ------------- Commit messages: - comments - initial commit Changes: https://git.openjdk.org/jdk/pull/23614/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23614&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349908 Stats: 38 lines in 1 file changed: 37 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23614.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23614/head:pull/23614 PR: https://git.openjdk.org/jdk/pull/23614 From aph at openjdk.org Thu Feb 13 14:56:25 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 13 Feb 2025 14:56:25 GMT Subject: RFR: 8347917: AArch64: Enable upper GPR registers in C1 [v4] In-Reply-To: References: Message-ID: <-lUQ9hM3qwCGmYoXbLR29ePe8bIkvmjxlcTeabqJO6o=.9f7291d4-0d33-45f1-9a53-d049fec86a4c@github.com> On Thu, 30 Jan 2025 08:32:25 GMT, Dmitry Chuyko wrote: >> This small change enables upper GPR registers in C1 so they are used, and used similar to C2. r19-r26 are declared as caller-saved and enabled, r27 (rheapbase) is declared caller-saved, r27 (rheapbase) and r29 (fp) are enabled conditionally similar to C2. r29 is already handled in MacroAssembler::build_frame()/remove_frame(). >> >> r18 is excluded on masOS and Windows as before. r27 is excluded when `UseCompressedOops` is on and `CompressedOops::base() != nullptr,` r29 is excluded when `PreserveFramePointer` is on. >> >> Registers are declared caller-saved in c1_FrameMap_aarch64.cpp, conditionally enabled ones are in the tail of enabled range which is adjusted in c1_FrameMap_aarch64.hpp, the code there was made similar to x86 (JDK-6985015). >> >> Register ranges are also updated in the linear scan itself and in OOP map generation. >> >> Having more allocatable registers help to avoid spills in register hungry code and thus improve performance and code density and simplify compilation. In practice the code that operates so many values is not too frequent and upper registers are used less frequently than first ones. To perform testing it turned to be useful to run C1 in a special mode when registers are allocated from upper to lower in LinearScanWalker::find_free_reg(): >> >> >> - for (int i = _first_reg; i <= _last_reg; i++) { >> + for (int i = _last_reg; i >= _first_reg; i--) { >> >> >> It was also useful to run the JVM with C1 compilation only and with different GCs and small heaps like `-XX:TieredStopAtLevel=1 -Xmx256m -XX:+UseSerialGC`. >> >> Tier1-3 jtreg tests showed no regression on linux-aarch64 (release, slowdebug, Xcomp) with either direct or reversed register allocation order. Windows and macOS were also tested to check r18 handling, +-CompressedOops and +-PreserveFramePointer combinations were tested. >> >> SHA3 Java implementation is as an example of register hungry code. Throughput results greatly depend on the actual CPU being used. On Graviton 2 the improvement in the dedicated micro-benchmark is ~**19%** for longer arrays (`-XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:-UseSHA3Intrinsics -jar ../benchmarks.jar -f 1 -wi 2 -i 3 -p digesterName=SHA3-256 -p length=16384 -jvmArgsAppend="-XX:-UseCompressedOops -XX:-PreserveFramePointer -Xmx31g -Xlog:gc+heap+coops=debug" MessageDigests.digest$`). > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Accurate caller-saved regs definition Marked as reviewed by aph (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23152#pullrequestreview-2615249170 From qamai at openjdk.org Thu Feb 13 14:59:02 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 13 Feb 2025 14:59:02 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v43] In-Reply-To: References: Message-ID: > Hi, > > This patch adds unsigned bounds and known bits constraints to TypeInt and TypeLong. This opens more transformation opportunities in an elegant manner as well as helps avoid some ad-hoc rules in Hotspot. > > In general, a `TypeInt/Long` represents a set of values `x` that satisfies: `x s>= lo && x s<= hi && x u>= ulo && x u<= uhi && (x & zeros) == 0 && (x & ones) == ones`. These constraints are not independent, e.g. an int that lies in [0, 3] in signed domain must also lie in [0, 3] in unsigned domain and have all bits but the last 2 being unset. As a result, we must canonicalize the constraints (tighten the constraints so that they are optimal) before constructing a `TypeInt/Long` instance. > > This is extracted from #15440 , node value transformations are left for later PRs. I have also added unit tests to verify the soundness of constraint normalization. > > Please kindly review, thanks a lot. > > Testing > > - [x] GHA > - [x] Linux x64, tier 1-4 Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 56 commits: - Merge branch 'master' into unsignedbounds - Merge branch 'master' into unsignedbounds - harden SimpleCanonicalResult - number lemmas - include - clean up intn_t - refine first_violation - assignment operator - exhaustive tests - make con - ... and 46 more: https://git.openjdk.org/jdk/compare/c2fc9478...3cd25862 ------------- Changes: https://git.openjdk.org/jdk/pull/17508/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17508&range=42 Stats: 2354 lines in 13 files changed: 1789 ins; 328 del; 237 mod Patch: https://git.openjdk.org/jdk/pull/17508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17508/head:pull/17508 PR: https://git.openjdk.org/jdk/pull/17508 From duke at openjdk.org Thu Feb 13 15:44:17 2025 From: duke at openjdk.org (Matthias Ernst) Date: Thu, 13 Feb 2025 15:44:17 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v22] In-Reply-To: References: Message-ID: On Sat, 8 Feb 2025 18:30:56 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > Reword correctness (fixes). Hey, at this point I'm happy to paste in anything you tell me, I'm not sure this is an efficient process for me to iterate on the language until it passes the bar. Bare in mind the logic here isn't mine, I've just renamed and reworded @rwestrel 's "is_always_zero" logic. The only addition being to recognize that a CONST node can be an LSHIFT node in spirit. If you want to decouple this, I can wait for you to add the proof to https://github.com/openjdk/jdk/commit/8af3b27ce98bcb9cf0c155c98d6b9a9bc159aafe#diff-b1bd52f0743843e15452764f48ff43c15dd3192a28bfb684b34149f0e964996eR1749 and then I'll rebase my PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2656999057 From kvn at openjdk.org Thu Feb 13 16:10:26 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 16:10:26 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v10] In-Reply-To: <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> Message-ID: On Thu, 13 Feb 2025 01:25:11 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request incrementally with two additional commits since the last revision: > > - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description > - add a separate adrp_movk function to to support targets located more than 4GB away Okay, then I will go ahead with my PR [23607](https://github.com/openjdk/jdk/pull/23607) which touches `CodeBlob` . ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2657073767 From roland at openjdk.org Thu Feb 13 16:35:26 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 13 Feb 2025 16:35:26 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop Message-ID: The test crashes because of a division by zero. The `Div` node for that one is initially part of a counted loop. The control input of the node is cleared because the divisor is non zero. This is because the divisor depends on the loop phi and the type of the loop phi is narrowed down when the counted loop is created. pre/main/post loops are created, unrolling happens, the main loop looses its backedge. The `Div` node can then float above the zero trip guard for the main loop. When the zero trip guard is not taken, there's no guarantee the divisor is non zero so the `Div` node should be pinned below it. I propose we revert the change I made with 8334724 which removed `PhaseIdealLoop::cast_incr_before_loop()`. The `CastII` that this method inserted was there to handle exactly this problem. It was added initially for a similar issue but with array loads. That problem with loads is handled some other way now and that's why I thought it was safe to proceed with the removal. The code in this patch is somewhat different from the one we had before for a couple reasons: 1- assert predicate code evolved and so previous logic can't be resurrected as it was. 2- the previous logic has a bug. Regarding 1-: during pre/main/post loop creation, we used to add the `CastII` and then to add assertion predicates (so assertion predicates depended on the `CastII`). Then when unrolling, when assertion predicates are updated, we would skip over the `CastII`. What I propose here is to add the `CastII` after assertion predicates are added. As a result, they don't depend on the `CastII` and there's no need for any extra logic when unrolling happens. This, however, doesn't work when the assertion predicates are added by RCE. In that case, I had to add logic to skip over the `CastII` (similar to what existed before I removed it). Regarding 2-: previous implementation for `PhaseIdealLoop::cast_incr_before_loop()` would add the `CastII` at the first loop `Phi` it encounters that's a use of the loop increment: it's usually the iv but not always. I tweaked the test case to show, this bug can actually cause a crash and changed the logic for `PhaseIdealLoop::cast_incr_before_loop()` accordingly. ------------- Commit messages: - fix & test Changes: https://git.openjdk.org/jdk/pull/23617/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23617&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349139 Stats: 137 lines in 6 files changed: 104 ins; 27 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23617.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23617/head:pull/23617 PR: https://git.openjdk.org/jdk/pull/23617 From jkarthikeyan at openjdk.org Thu Feb 13 16:36:26 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 13 Feb 2025 16:36:26 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v4] In-Reply-To: References: <0iE7uPGSBpBdlgayY_gqBpuPay-XSpjMdaOuqdo-nhs=.1c7fa2cb-f1ea-4810-8fe6-2e0e6af7b8ac@github.com> Message-ID: On Thu, 13 Feb 2025 09:57:16 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Add new conversions to benchmark > > src/hotspot/cpu/x86/matcher_x86.hpp line 273: > >> 271: if (to_bt == from_bt) { >> 272: return false; >> 273: } > > Hmm, do we expect that this ever gets triggered? Or would that be a bug? Maybe not, but could be worth adding a defensice assert here, what do you think? This is a good point, I think it shouldn't be possible to run into the case where `to_bt` and `from_bt` are the same here so a defensive assert would be better here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1954844820 From jkarthikeyan at openjdk.org Thu Feb 13 16:36:27 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 13 Feb 2025 16:36:27 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v4] In-Reply-To: References: <0iE7uPGSBpBdlgayY_gqBpuPay-XSpjMdaOuqdo-nhs=.1c7fa2cb-f1ea-4810-8fe6-2e0e6af7b8ac@github.com> Message-ID: <9kOYtFtZe8B9TtPdhq0wN_uHSUKHlH330aczNIgwaBI=.6bff845b-b7ed-45b7-9e4f-d83f58fb8648@github.com> On Thu, 13 Feb 2025 10:10:34 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/loopopts/superword/TestCompatibleUseDefTypeSize.java line 333: >> >>> 331: applyIfPlatform = {"64-bit", "true"}, >>> 332: applyIf = {"AlignVector", "false"}, >>> 333: applyIfCPUFeature = {"avx", "true"}) >> >> This may be a little nit-picky. But why have a new test-file when this test here was already trying to cover the conversion cases? I think I wrote it back then, and was just too lazy to write all conversion cases. I'd suggest you move your cases up here ;) > > I think I added these tests when I was reworking `SuperWord::is_velt_basic_type_compatible_use_def`, which you are now touching as well ? I think I didn't realize this test was there when I was working on my patch, but I agree it makes sense to move it there for clarity. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1954845649 From rgiulietti at openjdk.org Thu Feb 13 16:41:29 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Thu, 13 Feb 2025 16:41:29 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: <6bNEQk9vpUjHGLQUfnl_4bAPzoaU99oLWe0EICKTUJM=.e896ddc6-6f3b-4681-82fd-1d86b5d89d7b@github.com> On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic Shouln't L.6230 read as follows? // LZCNT = 31 - (biased_exp - 127) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2657162630 From roland at openjdk.org Thu Feb 13 16:57:30 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 13 Feb 2025 16:57:30 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop [v2] In-Reply-To: References: Message-ID: > The test crashes because of a division by zero. The `Div` node for > that one is initially part of a counted loop. The control input of the > node is cleared because the divisor is non zero. This is because the > divisor depends on the loop phi and the type of the loop phi is > narrowed down when the counted loop is created. pre/main/post loops > are created, unrolling happens, the main loop looses its backedge. The > `Div` node can then float above the zero trip guard for the main > loop. When the zero trip guard is not taken, there's no guarantee the > divisor is non zero so the `Div` node should be pinned below it. > > I propose we revert the change I made with 8334724 which removed > `PhaseIdealLoop::cast_incr_before_loop()`. The `CastII` that this > method inserted was there to handle exactly this problem. It was added > initially for a similar issue but with array loads. That problem with > loads is handled some other way now and that's why I thought it was > safe to proceed with the removal. > > The code in this patch is somewhat different from the one we had > before for a couple reasons: > > 1- assert predicate code evolved and so previous logic can't be > resurrected as it was. > > 2- the previous logic has a bug. > > Regarding 1-: during pre/main/post loop creation, we used to add the > `CastII` and then to add assertion predicates (so assertion predicates > depended on the `CastII`). Then when unrolling, when assertion > predicates are updated, we would skip over the `CastII`. What I > propose here is to add the `CastII` after assertion predicates are > added. As a result, they don't depend on the `CastII` and there's no > need for any extra logic when unrolling happens. This, however, > doesn't work when the assertion predicates are added by RCE. In that > case, I had to add logic to skip over the `CastII` (similar to what > existed before I removed it). > > Regarding 2-: previous implementation for > `PhaseIdealLoop::cast_incr_before_loop()` would add the `CastII` at > the first loop `Phi` it encounters that's a use of the loop increment: > it's usually the iv but not always. I tweaked the test case to show, > this bug can actually cause a crash and changed the logic for > `PhaseIdealLoop::cast_incr_before_loop()` accordingly. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge branch 'master' into JDK-8349139 - fix & test ------------- Changes: https://git.openjdk.org/jdk/pull/23617/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23617&range=01 Stats: 136 lines in 6 files changed: 103 ins; 27 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23617.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23617/head:pull/23617 PR: https://git.openjdk.org/jdk/pull/23617 From qamai at openjdk.org Thu Feb 13 16:57:30 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 13 Feb 2025 16:57:30 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 16:30:20 GMT, Roland Westrelin wrote: > The test crashes because of a division by zero. The `Div` node for > that one is initially part of a counted loop. The control input of the > node is cleared because the divisor is non zero. This is because the > divisor depends on the loop phi and the type of the loop phi is > narrowed down when the counted loop is created. pre/main/post loops > are created, unrolling happens, the main loop looses its backedge. The > `Div` node can then float above the zero trip guard for the main > loop. When the zero trip guard is not taken, there's no guarantee the > divisor is non zero so the `Div` node should be pinned below it. > > I propose we revert the change I made with 8334724 which removed > `PhaseIdealLoop::cast_incr_before_loop()`. The `CastII` that this > method inserted was there to handle exactly this problem. It was added > initially for a similar issue but with array loads. That problem with > loads is handled some other way now and that's why I thought it was > safe to proceed with the removal. > > The code in this patch is somewhat different from the one we had > before for a couple reasons: > > 1- assert predicate code evolved and so previous logic can't be > resurrected as it was. > > 2- the previous logic has a bug. > > Regarding 1-: during pre/main/post loop creation, we used to add the > `CastII` and then to add assertion predicates (so assertion predicates > depended on the `CastII`). Then when unrolling, when assertion > predicates are updated, we would skip over the `CastII`. What I > propose here is to add the `CastII` after assertion predicates are > added. As a result, they don't depend on the `CastII` and there's no > need for any extra logic when unrolling happens. This, however, > doesn't work when the assertion predicates are added by RCE. In that > case, I had to add logic to skip over the `CastII` (similar to what > existed before I removed it). > > Regarding 2-: previous implementation for > `PhaseIdealLoop::cast_incr_before_loop()` would add the `CastII` at > the first loop `Phi` it encounters that's a use of the loop increment: > it's usually the iv but not always. I tweaked the test case to show, > this bug can actually cause a crash and changed the logic for > `PhaseIdealLoop::cast_incr_before_loop()` accordingly. My understanding here is that when the backedge is removed, the loop head (which is a merge) is idealized out, and as a result, the `Phi` is idealized to its only live input. I think the idealization of the `Phi` is the issue here, the `Phi` is pinned at the control input it is at, so the result of the idealization cannot float freely. As a result, I propose fixing the idealization of `Phi` to become a `CastNode` that has `UnknownControlDependency` on its input. What do you think? I think this may be true for other kind of merge points as well. For example, if you perform conditional elimination (the patch you have worked at) on this code: if (v > 0) { int y; if (b) { y = v; } else { y = v + 1; } x / y; } Then we can be very clever and see that `y` cannot be 0 because it is a `Phi`. However, if we also later see that `b == true`, then the `Phi` is removed and the division `x / v` does not have a control input, which is wrong. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23617#issuecomment-2657204970 From roland at openjdk.org Thu Feb 13 16:58:55 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 13 Feb 2025 16:58:55 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v8] In-Reply-To: References: Message-ID: > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Emanuel Peter - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Emanuel Peter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/e4053783..0f1b76ab Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=06-07 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From kvn at openjdk.org Thu Feb 13 17:05:05 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:05:05 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> Message-ID: On Thu, 13 Feb 2025 05:14:59 GMT, Chris Plummer wrote: >> Example please. > > static Class wrapperClasses = new Class[Number_Of_Kinds]; > wrapperClasses[NMethodKind] = NMethodBlob.class; > wrapperClasses[BufferKind] = BufferBopb.class; > ...; > wrapperClasses[SafepointKind] = SafepointBlob.class; > > > > CodeBlob cb = new CodeBlob(addr); > return wrapperClasses[cb.getKind()]; Done. >> I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned. > > I guess my real question is whether or not it can be considered normal behavior to return null. It seems it should never happen, which is why I was suggesting an assert. With your suggested `wrapperClasses[]` we will get OOB exception. No need separate assert. >> `UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`. >> Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect. > > Ok. Leaving UncommonTrapKind and ExceptionKind uninitialized seems a bit error prone. Perhaps they can be given some sort of INVALID value. Done. Initialized them to `Number_Of_Kinds + 1`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954886028 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954890522 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954891616 From kvn at openjdk.org Thu Feb 13 17:05:04 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:05:04 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v7] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <5LGcbNB2_MigrbHGKV3CY8e6z-1iioFUuiSvTU8-lNY=.af273d17-6ab5-4b12-ae41-e6900494b5ee@github.com> > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Update SA based on comments - Merge branch 'master' into 8349088 - Fix Zero VM build - Fix Minimal and Zero VM builds once more - Fix Minimal and Zero VM builds again - Add CodeBlob proxy vtable - Fix Zero and Minimal VM builds - 8349088: De-virtualize Codeblob and nmethod ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/b09ddce6..515495b2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=05-06 Stats: 11482 lines in 618 files changed: 7914 ins; 1738 del; 1830 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From roland at openjdk.org Thu Feb 13 17:06:15 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 13 Feb 2025 17:06:15 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 16:54:18 GMT, Quan Anh Mai wrote: > My understanding here is that when the backedge is removed, the loop head (which is a merge) is idealized out, and as a result, the `Phi` is idealized to its only live input. I think the idealization of the `Phi` is the issue here, the `Phi` is pinned at the control input it is at, so the result of the idealization cannot float freely. As a result, I propose fixing the idealization of `Phi` to become a `CastNode` that has `UnknownControlDependency` on its input. What do you think? For any `Phi`? This seems like an issue that's specific to the counted loop iv. I don't think we want to add `CastII` nodes unless we're sure they are needed. I thought about adding the `CastII` when the loop backedge disappears but I think when this happens it's too late to find which of the `Phi` is the loop phi. > I think this may be true for other kind of merge points as well. For example, if you perform conditional elimination (the patch you have worked at) on this code: Are you talking about https://bugs.openjdk.org/browse/JDK-8275202 ? But then wouldn't we want to add the `CastII` only if one is needed, that is when that pass runs, so we don't end up with those casts all over the place? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23617#issuecomment-2657227412 From kvn at openjdk.org Thu Feb 13 17:14:59 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:14:59 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: rename SA argument ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/515495b2..61fdee68 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=06-07 Stats: 6 lines in 1 file changed: 2 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From cjplummer at openjdk.org Thu Feb 13 17:14:59 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Thu, 13 Feb 2025 17:14:59 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 16:57:18 GMT, Vladimir Kozlov wrote: >>> What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. >> >> I don't need any more mapping from CodeBlob class to corresponding virtual table name which does not exist anymore. `CodeBlob::_kind` field's value is used to determine which class should be used. >> >> I think `hashMap` is overkill here. I can construct array `Class cbClasses[]` and use `cbClasses[CodeBlob::_kind]` instead of `if/else` in `getClassFor`. But I would still need to check for unknown value of `CodeBlob::_kind` somehow. > >> impact on things like the "findpc" functionality > > Do you mean `findpc()` function in VM which is used in debugger? Nothing should be changed for it. > It calls `os::print_location()` which calls `CodeBlob::dump_for_addr(addr, st, verbose);`: > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/os.cpp#L1278 Actually I was referring to the clhsdb findpc command, which uses PointerFinder, but actually that should be ok because it special cases the codecache and knows how to find CodeBlobs in it. It's the clhsdb "inspect" command that will no longer be able to identify the type for an address that points to the start of a CodeBlob. This is true of any address that points to the start of a hotspot C++ object that does not have a vtable, or is not declared in vmstructs. So it's not a new issue, but is just adding more types to the list that "inspect" won't figure out. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954906641 From kvn at openjdk.org Thu Feb 13 17:14:59 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:14:59 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> Message-ID: On Thu, 13 Feb 2025 05:19:48 GMT, Chris Plummer wrote: >> `cbPc` with comment explaining that it could be inside code blob. > > That sounds fine. done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954906986 From never at openjdk.org Thu Feb 13 17:22:13 2025 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 13 Feb 2025 17:22:13 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23610#pullrequestreview-2615740718 From qamai at openjdk.org Thu Feb 13 17:22:13 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 13 Feb 2025 17:22:13 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 17:03:06 GMT, Roland Westrelin wrote: > For any `Phi`? This seems like an issue that's specific to the counted loop iv. I don't think we want to add `CastII` nodes unless we're sure they are needed. I think it is probably because we are more aggressive with the type of loop phis. Conceptually, a `Phi` is pinned, and idealizing it into a floating node is incorrect. So, there may be issues lurking around due to this. We can introduce another phase that tries to remove all `CastNode` and pins all nodes that need that dependency (such as `Div`, loads, etc). What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23617#issuecomment-2657267497 From jrose at openjdk.org Thu Feb 13 17:25:13 2025 From: jrose at openjdk.org (John R Rose) Date: Thu, 13 Feb 2025 17:25:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument One related idea: The Vptr classes seem to be regular enough to be templated. That is, one class body, instantiated with a template argument for each code blob type (and probably another for the enum). That would remove some of the distracting per-class boilerplate. Something like: template class Vptr_Impl : public Vptr { override void print_on(const CodeBlob* instance, outputStream* st) const { assert(instance->kind() == Tkind, "sanity"); ((const CB_T*)instance)->print_on_impl(st); } ? override bool assert_sane(cosnt CodeBlob* instance) { assert(instance->kind() == Tkind, ""); return true; } }; class CodeBlob { public: final Vptr* vptr() const { Vptr* vptr = vptr_array[_kind]; assert(vptr->assert_sant(this), "correct array element"); return vptr; } final void print_on(outputStream* st) const { vptr()->print_on(this, st); } }; Then: const Vptr* array[] = { &Vptr_Impl(), ... &Vptr_Impl(), ... }; The array could be filled by a macro that tracks the enum members; I like that as a small job for macros (no code in it). Then: class UncommonTrapBlob : public OtherBlob { protected: // impl "M" method is not public void print_on_impl(outputStream* st) const { OtherBlob::print_on_impl(st); st->print("my field = %d", _my_field); } // Vptr needs to call impl method friend class Vptr_Impl; // this might break down, so make it all public in the end }; I don't see any reason the Vptr subclasses need to be related in any more detail as subs or supers. Well, C++ is a bag of surprises, so there are probably several reasons the above sketch is wrong. But something like it might add a little more readability and predictability to the code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657274388 From kvn at openjdk.org Thu Feb 13 17:29:14 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:29:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:22:18 GMT, John R Rose wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> rename SA argument > > One related idea: The Vptr classes seem to be regular enough to be templated. That is, one class body, instantiated with a template argument for each code blob type (and probably another for the enum). That would remove some of the distracting per-class boilerplate. Something like: > > > template > class Vptr_Impl : public Vptr { > override void print_on(const CodeBlob* instance, outputStream* st) const { > assert(instance->kind() == Tkind, "sanity"); > ((const CB_T*)instance)->print_on_impl(st); > } > ? > override bool assert_sane(cosnt CodeBlob* instance) { > assert(instance->kind() == Tkind, ""); > return true; > } > }; > > class CodeBlob { > public: > final Vptr* vptr() const { > Vptr* vptr = vptr_array[_kind]; > assert(vptr->assert_sant(this), "correct array element"); > return vptr; > } > final void print_on(outputStream* st) const { > vptr()->print_on(this, st); > } > }; > > > Then: > > > const Vptr* array[] = { > &Vptr_Impl(), > ... > &Vptr_Impl(), > ... > }; > > > The array could be filled by a macro that tracks the enum members; I like that as a small job for macros (no code in it). > > Then: > > > class UncommonTrapBlob : public OtherBlob { > protected: // impl "M" method is not public > void print_on_impl(outputStream* st) const { > OtherBlob::print_on_impl(st); > st->print("my field = %d", _my_field); > } > // Vptr needs to call impl method > friend class Vptr_Impl; // this might break down, so make it all public in the end > }; > > > I don't see any reason the Vptr subclasses need to be related in any more detail as subs or supers. > > Well, C++ is a bag of surprises, so there are probably several reasons the above sketch is wrong. But something like it might add a little more readability and predictability to the code. Thank you, @rose00 and @xmas92, for review and suggestions. Let me say it first - printing code for code blobs and nmethod is big mess. It requires separate big change to clean it up. For example, I have to go through CodeBlob's virtual dispatch `print_value_on_v()` for nmethod because some sets of `nmethod::print*()` are defined only in debug VM: [nmethod.hpp#L919](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/nmethod.hpp#L919) Then `nmethod` has other mess which requires C++ trickery because it does not follow print API in CodeBlob: void print(outputStream* st) const; // need to re-define this from CodeBlob else the overload hides it void print_on(outputStream* st) const override { CodeBlob::print_on(st); } void print_on(outputStream* st, const char* msg) const; ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657282969 From kvn at openjdk.org Thu Feb 13 17:37:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:37:19 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument Saying that, I agree that I need to add comments explaining printing API and how Vptr class will work. I will work on @xmas92 suggestions and look on using `_impl`. I will try to look on templates @rose00 suggested but I don't want to complicate code for just for few print methods. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657303967 From kvn at openjdk.org Thu Feb 13 18:04:17 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 18:04:17 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <3RrosS3Q-iEBqaD4hVGMfjY2hDGLqwWwSUqgT0Za1k4=.1e32f3f0-6677-4082-b100-ce9b4603ec80@github.com> On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument > AFAICT `print_value_on` is unreachable It is reachable in product VM when `print_value_on_v()` is called for `nmethod` which does not have `print_value_on()` in product VM. Which can be solved by adding simple `nmethod::print_value_on()` for product VM but it will change current behavior. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657354310 From epeter at openjdk.org Thu Feb 13 18:32:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 18:32:17 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v22] In-Reply-To: References: Message-ID: On Sat, 8 Feb 2025 18:30:56 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > Reword correctness (fixes). Let me suggest this: // Checks whether expr is neutral additive element (zero) under mask, // i.e. whether an expression of the form: // (AndX (AddX (expr addend) mask) // (expr + addend) & mask // is equivalent to // (AndX addend mask) // addend & mask // for any addend. // (The X in AndX must be I or L, depending on bt). // // We check for the sufficient condition when the lowest set bit in expr is higher than // the highest set bit in mask, i.e.: // expr: eeeeee0000000000000 // mask: 000000mmmmmmmmmmmmm // <--w bits---> // We do not test for other cases. // // Correctness: // Given "expr" with at least "w" trailing zeros, // let "mod = 2^w", "suffix_mask = mod - 1" // // Since "mask" only has bits set where "suffix_mask" does, we have: // mask = suffix_mask & mask (SUFFIX_MASK) // // And since expr only has bits set above w, and suffix_mask only below: // expr & suffix_mask == 0 (NO_BIT_OVERLAP) // // From unsigned modular arithmetic (with unsigned modulo %), and since mod is // a power of 2, and we are computing in a ring of powers of 2, we know that // (x + y) % mod = (x % mod + y) % mod // (x + y) & suffix_mask = (x & suffix_mask + y) & suffix_mask (MOD_ARITH) // // We can now prove the equality: // (expr + addend) & mask // = (expr + addend) & suffix_mask & mask (SUFFIX_MASK) // = (expr & suffix_mask + addend) & suffix_mask & mask (MOD_ARITH) // = (0 + addend) & suffix_mask & mask (NO_BIT_OVERLAP) // = addend & mask (SUFFIX_MASK) // // Hence, an expr with at least w trailing zeros is a neutral additive element under any mask with bit width w. I somewhat agree with @merykitty that `%` can be misleading, but that's why I mention unsigned modulo. My `MOD_ARITH` step is a little hand-wavy, but also short. @merykitty does it a bit more thoroughly. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2657418539 From shade at openjdk.org Thu Feb 13 18:57:10 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 13 Feb 2025 18:57:10 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 [v2] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:50:51 GMT, Aleksey Shipilev wrote: >> Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: >> >> >> ... >> [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 >> [99] javax.enterprise.deploy.shared.DConfigBeanVersionType >> [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 >> [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 >> ... >> >> >> I narrowed it down to level downgrade in compilation policy here: >> https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 >> >> [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. >> >> I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. >> >> I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Shortcut CTW tasks directly Thanks! I think I need a second Reviewer for this to go in :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23589#issuecomment-2657466093 From vlivanov at openjdk.org Thu Feb 13 19:01:21 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 13 Feb 2025 19:01:21 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 12 Feb 2025 21:09:31 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > fix bounds checks src/hotspot/share/runtime/deoptimization.cpp line 645: > 643: methodHandle method(current, deopt_sender.interpreter_frame_method()); > 644: Bytecode_invoke cur = Bytecode_invoke_check(method, deopt_sender.interpreter_frame_bci()); > 645: if (cur.is_invokedynamic() || cur.is_invokehandle()) { Can you elaborate, please, why invokedynamic case is not needed anymore? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1955062378 From psandoz at openjdk.org Thu Feb 13 19:26:11 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Thu, 13 Feb 2025 19:26:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic Thank you for fixing this. More broadly we should double check the intrinsics of Long/Integer.numberOfLeading/Trailing/Zeros (we added in the integration of Integration of JEP 426: Vector API) and follow up with any necessary tests and/or fixes in subsequent PRs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2657520396 From cjplummer at openjdk.org Thu Feb 13 19:31:15 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Thu, 13 Feb 2025 19:31:15 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 97: > 95: // cbAddr - address of a code blob > 96: // cbPC - address inside of a code blob > 97: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address cbPC) { Can you change findBlobUnsafe() above also? That's where the naming problem originated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955098013 From dnsimon at openjdk.org Thu Feb 13 19:38:19 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Feb 2025 19:38:19 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23610#issuecomment-2657542024 From dnsimon at openjdk.org Thu Feb 13 19:38:19 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Feb 2025 19:38:19 GMT Subject: Integrated: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: <7BPX92kK6cDWVILYcvyQXfSssFDFjv0XjIZQlGnlRhI=.6521b1d0-9c70-4054-a276-601536946443@github.com> On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. This pull request has now been integrated. Changeset: a88e2a58 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/a88e2a58bf834081db55c2071d072567ea763354 Stats: 7 lines in 3 files changed: 0 ins; 0 del; 7 mod 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong Reviewed-by: yzheng, never ------------- PR: https://git.openjdk.org/jdk/pull/23610 From duke at openjdk.org Thu Feb 13 19:39:19 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 13 Feb 2025 19:39:19 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:53:23 GMT, Emanuel Peter wrote: >> `calc_xor_max` is a static method in the cpp file and thus it is not preferrable to import it directly into the test file. > > I suppose this is more of a cosmetic concern, not as important. Up to you what you want to do @j3graham . I tried a few variations, and this seemed the least invasive one. I miss java package-scope. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1955109012 From duke at openjdk.org Thu Feb 13 19:44:18 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 13 Feb 2025 19:44:18 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 08:57:01 GMT, Emanuel Peter wrote: >> Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: >> >> formatting, remove commented tests > > src/hotspot/share/opto/addnode.cpp line 1010: > >> 1008: if( r0 == TypeInt::BOOL && ( r1 == TypeInt::ONE >> 1009: || r1 == TypeInt::BOOL)) >> 1010: return TypeInt::BOOL; > > It looks to me like this case should be covered by `calc_xor_max` below. Do we have any IR tests that verify that this still gets optimized as before? Yes - it's essentially the 1-bit case of the more general function. I have some IR tests for it in XorINodeIdealizationTests - `testConstXorBool` and `testXorSelfBool`. I'll add another one. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1955115122 From duke at openjdk.org Thu Feb 13 19:47:15 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 13 Feb 2025 19:47:15 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:54:22 GMT, Emanuel Peter wrote: >> I think the comment should be removed, the specs and implementation of `calc_xor_max` should be referred to when trying to understand this piece of code. > > I think a better name like `calculate_upper_bound_of_xor_with_non_negative_inputs` would help a lot, and make some comments redundant. I have removed the comment - it was left over from the original code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1955118824 From duke at openjdk.org Thu Feb 13 19:54:13 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 13 Feb 2025 19:54:13 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: <7dgx6Zd1xWic0DylfFel96syOpGro66pOtwnjYW-zF0=.ef956bde-add9-487b-9e25-8b9ed0dc5b86@github.com> On Thu, 13 Feb 2025 09:50:48 GMT, Emanuel Peter wrote: >> I think the name is good enough, calculate the max of a xor is a pretty self-explanatory name. You just need a better description for the method. Suggestion: >> >> Given 2 non-negative values in the ranges [0, hi_0] and [0, hi_1], respectively. The bitwise xor of these values should also be non-negative. This method calculates its maximum. > > What about calling it `calculate_upper_bound_of_xor_with_non_negative_inputs`? I've gone with `calc_xor_upper_bound_of_non_neg` and added the clarifying method description. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1955126507 From duke at openjdk.org Thu Feb 13 20:02:04 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 13 Feb 2025 20:02:04 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v28] In-Reply-To: References: Message-ID: > C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/4a291202..0e9b7cbe Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=27 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=26-27 Stats: 30 lines in 2 files changed: 13 ins; 6 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From liach at openjdk.org Thu Feb 13 20:07:09 2025 From: liach at openjdk.org (Chen Liang) Date: Thu, 13 Feb 2025 20:07:09 GMT Subject: RFR: 8349943: [JMH] Use jvmArgs consistently In-Reply-To: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> References: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> Message-ID: On Thu, 13 Feb 2025 08:35:47 GMT, Nicole Xu wrote: > As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. > > All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. The java.lang.foreign arg changes look fine. ------------- PR Review: https://git.openjdk.org/jdk/pull/23609#pullrequestreview-2616102653 From kvn at openjdk.org Thu Feb 13 20:10:24 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 20:10:24 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v10] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> Message-ID: On Thu, 13 Feb 2025 11:55:17 GMT, Boris Ulasevich wrote: > The change is not yet ready for final testing. I still need to remove my raw access workaround in nmethod::oop_at and rebase onto #23512 once it has been integrated. May be that is why I see compiler tests failed with this changes when run with ZGC. I will look on that PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2657606702 From duke at openjdk.org Thu Feb 13 20:28:51 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 13 Feb 2025 20:28:51 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v29] In-Reply-To: References: Message-ID: > C2 does not eliminate XOR nodes with constant arguments. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This patch demonstrates a potential fix to the problem, but there might well be better ways to do it. Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: update test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/0e9b7cbe..a5228989 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=28 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=27-28 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Thu Feb 13 21:07:16 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 13 Feb 2025 21:07:16 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v29] In-Reply-To: References: Message-ID: <50ACG4t_gY0zBaYVOifz4pgo4DTOH-o2de--OPWNs60=.e760cce6-ab39-4f35-8676-0f4fd44c253c@github.com> On Thu, 13 Feb 2025 20:28:51 GMT, Johannes Graham wrote: >> An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. >> >> In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: >> - Bounds optimization of xor >> - A check for `x ^ x = 0` >> - Explicit testing of xor over booleans. >> >> Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. >> >> --------- >> ### Progress >> - [x] Change must not contain extraneous whitespace >> - [x] Commit message must refer to an issue >> - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) >> >> >> >> ### Reviewers >> * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) >> >> ### Reviewing >>
Using git >> >> Checkout this PR locally: \ >> `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ >> `$ git checkout pull/23089` >> >> Update a local copy of the PR: \ >> `$ git checkout pull/23089` \ >> `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` >> >>
>>
Using Skara CLI tools >> >> Checkout this PR locally: \ >> `$ git pr checkout 23089` >> >> View PR using the GUI difftool: \ >> `$ git pr show -t 23089` >> >>
>>
Using diff file >> >> Download this PR as a diff file: \ >> https://git.openjdk.org/jdk/pull/23089.diff >> >>
>>
Using Webrev >> >> [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-25939... > > Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: > > update test I have updated the summary to be more informative. I believe the related optimizations are now covered in tests. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23089#issuecomment-2657707937 From duke at openjdk.org Thu Feb 13 21:16:33 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 13 Feb 2025 21:16:33 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v30] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: fix variable names in comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/a5228989..42b20b52 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=29 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=28-29 Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From dlong at openjdk.org Thu Feb 13 21:23:15 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 13 Feb 2025 21:23:15 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: <7xJgm0ScXMp4iRaH7Sf5QfsrTv2jOV4078kPqn3aoCs=.63303086-b4bd-47c5-9bd5-e69e28f75f4c@github.com> On Thu, 13 Feb 2025 18:57:24 GMT, Vladimir Ivanov wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> fix bounds checks > > src/hotspot/share/runtime/deoptimization.cpp line 645: > >> 643: methodHandle method(current, deopt_sender.interpreter_frame_method()); >> 644: Bytecode_invoke cur = Bytecode_invoke_check(method, deopt_sender.interpreter_frame_bci()); >> 645: if (cur.is_invokedynamic() || cur.is_invokehandle()) { > > Can you elaborate, please, why invokedynamic case is not needed anymore? As far as I can tell, it was never needed. If an invokedynamic or invokehandle adds an appendix, then it will show up in the callee, and will be reflected in the caller args size, so there is no mismatch. As far as the JVM is concerned, an invokedynamic/invokehandle looks like a call to a JVM-generated adapter. The only way for invokedynamic/invokehandle to cause an argument mismatch is if the JVM resolved the call-site to an adapter that was actually a MethodHandle linker. That is the exception I describe in the comment below. If we ever allowed the JVM to do that, then several other checks would also need to be fixed. For the record, this code used to call cur.is_method_handle_invoke(), which was also wrong, but at least it had a name closer to what we would want. Ideally, something like is_method_handle_linker_invoke() that checks for linkToVirtual, linkToStatic, linkToSpecial, and linkToInterface would have been better. The old comment about "arbitrary chains of calls" seems to be left over from an early JSR292 feature known as Ricochet Frames. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1955228726 From dlong at openjdk.org Thu Feb 13 21:26:12 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 13 Feb 2025 21:26:12 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: <7xJgm0ScXMp4iRaH7Sf5QfsrTv2jOV4078kPqn3aoCs=.63303086-b4bd-47c5-9bd5-e69e28f75f4c@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> <7xJgm0ScXMp4iRaH7Sf5QfsrTv2jOV4078kPqn3aoCs=.63303086-b4bd-47c5-9bd5-e69e28f75f4c@github.com> Message-ID: On Thu, 13 Feb 2025 21:20:33 GMT, Dean Long wrote: >> src/hotspot/share/runtime/deoptimization.cpp line 645: >> >>> 643: methodHandle method(current, deopt_sender.interpreter_frame_method()); >>> 644: Bytecode_invoke cur = Bytecode_invoke_check(method, deopt_sender.interpreter_frame_bci()); >>> 645: if (cur.is_invokedynamic() || cur.is_invokehandle()) { >> >> Can you elaborate, please, why invokedynamic case is not needed anymore? > > As far as I can tell, it was never needed. If an invokedynamic or invokehandle adds an appendix, then it will show up in the callee, and will be reflected in the caller args size, so there is no mismatch. As far as the JVM is concerned, an invokedynamic/invokehandle looks like a call to a JVM-generated adapter. The only way for invokedynamic/invokehandle to cause an argument mismatch is if the JVM resolved the call-site to an adapter that was actually a MethodHandle linker. That is the exception I describe in the comment below. If we ever allowed the JVM to do that, then several other checks would also need to be fixed. > For the record, this code used to call cur.is_method_handle_invoke(), which was also wrong, but at least it had a name closer to what we would want. Ideally, something like is_method_handle_linker_invoke() that checks for linkToVirtual, linkToStatic, linkToSpecial, and linkToInterface would have been better. > The old comment about "arbitrary chains of calls" seems to be left over from an early JSR292 feature known as Ricochet Frames. For the curious, it is still possible create an arbitrarily long chain of linkTo calls, but only trusted code would be able to do that, so I'm not addressing this issue in this PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1955232327 From dlong at openjdk.org Thu Feb 13 21:35:12 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 13 Feb 2025 21:35:12 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: <_jQq0sGaZijMT6Cr3rUdrQLlvVWNuJ8uILHg1qkCxoM=.271ce257-9e72-4d61-87fa-588ce4dbe107@github.com> On Thu, 13 Feb 2025 06:49:50 GMT, Richard Reingruber wrote: > The 2nd assert does not fail w/o the deoptimization.cpp fix. Might be due to alignement of caller->sp() in the interpreter. Aarch64 also does alignment, and that's why the test uses two different methods, one with an extra local, to hopefully handle both cases of even/odd 2-word (16 byte) alignment. But ppc might be different enough that this isn't enough to trigger the bug. Or maybe the end of frame bound is slightly off? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2657755434 From prr at openjdk.org Thu Feb 13 22:18:11 2025 From: prr at openjdk.org (Phil Race) Date: Thu, 13 Feb 2025 22:18:11 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 14:30:30 GMT, Per Minborg wrote: >> `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. >> >> To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. > > src/java.desktop/share/classes/javax/imageio/stream/ImageInputStreamImpl.java line 245: > >> 243: throw new EOFException(); >> 244: } >> 245: return (byteOrder == ByteOrder.BIG_ENDIAN) > > This could just be `ByteArray.getShortBO(byteBuff, 0, byteOrder == ByteOrder.BIG_ENDIAN)`. Same for the others. That would look cleaner. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1955287769 From dlong at openjdk.org Thu Feb 13 22:50:17 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 13 Feb 2025 22:50:17 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/c1/Runtime1.java line 65: > 63: public CodeBlob blobFor(int id) { > 64: Address blobAddr = blobsField.getStaticFieldAddress().getAddressAt(id * VM.getVM().getAddressSize()); > 65: return VM.getVM().getCodeCache().createCodeBlobWrapper(blobAddr); We don't need to change all the callers if we keep a 1-arg version of createCodeBlobWrapper(): public CodeBlob createCodeBlobWrapper(Address codeBlobAddr) { return createCodeBlobWrapper(codeBlobAddr, codeBlobAddr); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955316582 From dlong at openjdk.org Thu Feb 13 23:04:17 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 13 Feb 2025 23:04:17 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument src/hotspot/share/compiler/oopMap.cpp line 567: > 565: fr->print_on(tty); > 566: tty->print(" "); > 567: cb->print_value_on(tty); tty->cr(); We could minimize the number of files changed if we keep print_value_on() for compatibility: void print_value_on(outputStream* st) const { print_value_on_v(st); } src/hotspot/share/runtime/vframe.inline.hpp line 178: > 176: INTPTR_FORMAT " not found or invalid at %d", > 177: p2i(_frame.pc()), decode_offset); > 178: nm()->print_on_v(&ss); I suggest removing _v suffix to reduce changes and match existing naming. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955325657 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955327438 From dlong at openjdk.org Fri Feb 14 00:11:16 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 14 Feb 2025 00:11:16 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument src/hotspot/share/code/codeBlob.hpp line 669: > 667: > 668: jobject receiver() { return _receiver; } > 669: ByteSize frame_data_offset() { return _frame_data_offset; } `frame_data_offset()` seems to be unused. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955373697 From dlong at openjdk.org Fri Feb 14 00:17:14 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 14 Feb 2025 00:17:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument HotSpot C++ changes look good. I skipped SA changes. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2616477660 From syan at openjdk.org Fri Feb 14 01:59:09 2025 From: syan at openjdk.org (SendaoYan) Date: Fri, 14 Feb 2025 01:59:09 GMT Subject: RFR: 8349943: [JMH] Use jvmArgs consistently In-Reply-To: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> References: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> Message-ID: On Thu, 13 Feb 2025 08:35:47 GMT, Nicole Xu wrote: > As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. > > All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. LGTM ------------- Marked as reviewed by syan (Committer). PR Review: https://git.openjdk.org/jdk/pull/23609#pullrequestreview-2616591288 From duke at openjdk.org Fri Feb 14 04:39:56 2025 From: duke at openjdk.org (Johannes Graham) Date: Fri, 14 Feb 2025 04:39:56 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v31] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: - Merge branch 'openjdk:master' into xor_const - fix variable names in comments - update test - address review comments - formatting, remove commented tests - add IR tests for long, simplify tests for int - formatting - add sanity asserts to tests - re-add tests - try fewer tests - ... and 35 more: https://git.openjdk.org/jdk/compare/ff52859d...16049cdc ------------- Changes: https://git.openjdk.org/jdk/pull/23089/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=30 Stats: 366 lines in 5 files changed: 323 ins; 25 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From xgong at openjdk.org Fri Feb 14 06:26:10 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 14 Feb 2025 06:26:10 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations In-Reply-To: References: Message-ID: <4MpRHYuBSykPsmf5fBiM19eDqaecdNlJCr85x60XIRI=.37c7cdba-5baf-462f-8e43-48b3677a22b1@github.com> On Thu, 13 Feb 2025 11:50:13 GMT, Bhavana Kilambi wrote: >> Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures. >> >> The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture, with different UseSVE options. Here is the gain details: >> >> >> Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2 >> ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x >> ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x >> ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x >> ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x >> ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x >> ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x >> ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x >> ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x >> ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x >> ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x >> ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x >> ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x >> IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x >> IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x >> IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x >> IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x >> IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x >> IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x >> IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x >> IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x >> IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x >> IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x >> IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x >> IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x >> LongMaxVector.SADD 102... > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 1574: > >> 1572: instruct vsqadd_masked(vReg dst_src1, vReg src2, pRegGov pg) %{ >> 1573: predicate(UseSVE == 2 && !n->as_SaturatingVector()->is_unsigned()); >> 1574: match(Set dst_src1 (SaturatingAddV (Binary dst_src1 src2) pg)); > > for the masked match rules, should we also add `USE_DEF` effect for `dst_src1` to indicate that this register is both read and written to destructively ? I see that other similarly defined match rules in the ad file do not have this effect defined but I am wondering if this should be done? Hi @Bhavana-Kilambi , thanks for looking at this PR! And yes, the `dst_src1` should be `USE_DEF` actually, but I think it's safe not adding the effect here manually. The compiler adlc will add the use-def information for each operands when parsing each match rule. You may look at the code details from https://github.com/openjdk/jdk/blob/master/src/hotspot/share/adlc/formssel.cpp#L939 . ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23608#discussion_r1955617857 From duke at openjdk.org Fri Feb 14 07:06:01 2025 From: duke at openjdk.org (Matthias Ernst) Date: Fri, 14 Feb 2025 07:06:01 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v23] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 40 additional commits since the last revision: - Merge branch 'openjdk:master' into mernst/JDK-8346664 - Reword correctness (fixes). - Merge branch 'openjdk:master' into mernst/JDK-8346664 - Reword correctness. - Reword correctness. - Comments, "Proof", order of checks. - Apply suggestions from code review Co-authored-by: Emanuel Peter - jlong, not long - Merge branch 'openjdk:master' into mernst/JDK-8346664 - dropped bug ref. - ... and 30 more: https://git.openjdk.org/jdk/compare/1fbd623d...c1ea9576 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/09f01e80..c1ea9576 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=22 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=21-22 Stats: 13597 lines in 743 files changed: 9862 ins; 1449 del; 2286 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From duke at openjdk.org Fri Feb 14 07:18:52 2025 From: duke at openjdk.org (Matthias Ernst) Date: Fri, 14 Feb 2025 07:18:52 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: incorporate @eme64's comment suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22856/files - new: https://git.openjdk.org/jdk/pull/22856/files/c1ea9576..b7a16a17 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=23 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22856&range=22-23 Stats: 22 lines in 1 file changed: 16 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/22856.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22856/head:pull/22856 PR: https://git.openjdk.org/jdk/pull/22856 From duke at openjdk.org Fri Feb 14 07:22:25 2025 From: duke at openjdk.org (Matthias Ernst) Date: Fri, 14 Feb 2025 07:22:25 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions I like it, thanks a lot for finishing that up! Incorporated and pushed your version. I was _trying_ to avoid a `%` operator and argue with "congruence", e.g. if a and b `equiv` mod m, then a & mask = b & mask, but did not succeed at it :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2658470281 From dchuyko at openjdk.org Fri Feb 14 07:57:15 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Fri, 14 Feb 2025 07:57:15 GMT Subject: Integrated: 8347917: AArch64: Enable upper GPR registers in C1 In-Reply-To: References: Message-ID: On Thu, 16 Jan 2025 12:26:43 GMT, Dmitry Chuyko wrote: > This small change enables upper GPR registers in C1 so they are used, and used similar to C2. r19-r26 are declared as caller-saved and enabled, r27 (rheapbase) is declared caller-saved, r27 (rheapbase) and r29 (fp) are enabled conditionally similar to C2. r29 is already handled in MacroAssembler::build_frame()/remove_frame(). > > r18 is excluded on masOS and Windows as before. r27 is excluded when `UseCompressedOops` is on and `CompressedOops::base() != nullptr,` r29 is excluded when `PreserveFramePointer` is on. > > Registers are declared caller-saved in c1_FrameMap_aarch64.cpp, conditionally enabled ones are in the tail of enabled range which is adjusted in c1_FrameMap_aarch64.hpp, the code there was made similar to x86 (JDK-6985015). > > Register ranges are also updated in the linear scan itself and in OOP map generation. > > Having more allocatable registers help to avoid spills in register hungry code and thus improve performance and code density and simplify compilation. In practice the code that operates so many values is not too frequent and upper registers are used less frequently than first ones. To perform testing it turned to be useful to run C1 in a special mode when registers are allocated from upper to lower in LinearScanWalker::find_free_reg(): > > > - for (int i = _first_reg; i <= _last_reg; i++) { > + for (int i = _last_reg; i >= _first_reg; i--) { > > > It was also useful to run the JVM with C1 compilation only and with different GCs and small heaps like `-XX:TieredStopAtLevel=1 -Xmx256m -XX:+UseSerialGC`. > > Tier1-3 jtreg tests showed no regression on linux-aarch64 (release, slowdebug, Xcomp) with either direct or reversed register allocation order. Windows and macOS were also tested to check r18 handling, +-CompressedOops and +-PreserveFramePointer combinations were tested. > > SHA3 Java implementation is as an example of register hungry code. Throughput results greatly depend on the actual CPU being used. On Graviton 2 the improvement in the dedicated micro-benchmark is ~**19%** for longer arrays (`-XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:-UseSHA3Intrinsics -jar ../benchmarks.jar -f 1 -wi 2 -i 3 -p digesterName=SHA3-256 -p length=16384 -jvmArgsAppend="-XX:-UseCompressedOops -XX:-PreserveFramePointer -Xmx31g -Xlog:gc+heap+coops=debug" MessageDigests.digest$`). This pull request has now been integrated. Changeset: 57f4c30f Author: Dmitry Chuyko URL: https://git.openjdk.org/jdk/commit/57f4c30fb6be1da57c8fcc742b5c36d842eef397 Stats: 82 lines in 7 files changed: 56 ins; 2 del; 24 mod 8347917: AArch64: Enable upper GPR registers in C1 Reviewed-by: aph ------------- PR: https://git.openjdk.org/jdk/pull/23152 From rcastanedalo at openjdk.org Fri Feb 14 08:22:41 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 08:22:41 GMT Subject: RFR: 8350006: IGV: show memory slices as type information Message-ID: This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) #### Testing - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). ------------- Commit messages: - Display mem slice when the 'Show types' filter is enabled Changes: https://git.openjdk.org/jdk/pull/23621/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23621&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350006 Stats: 13 lines in 1 file changed: 8 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23621.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23621/head:pull/23621 PR: https://git.openjdk.org/jdk/pull/23621 From dlunden at openjdk.org Fri Feb 14 08:59:09 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 14 Feb 2025 08:59:09 GMT Subject: RFR: 8350006: IGV: show memory slices as type information In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 19:47:25 GMT, Roberto Casta?eda Lozano wrote: > This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: > > ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) > > #### Testing > > - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > > - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). Looks good and useful. It would be nice if we would not need to rely on parsing `dump_spec` for the required information, which is potentially quite fragile if `dump_spec`s ever change. Maybe, in the future, we could dump node information in a more structured manner? E.g., `dump_spec_json` or similar. I guess it's probably too much work for too little gain. ------------- Marked as reviewed by dlunden (Committer). PR Review: https://git.openjdk.org/jdk/pull/23621#pullrequestreview-2617121231 From epeter at openjdk.org Fri Feb 14 08:59:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 08:59:17 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 07:19:47 GMT, Matthias Ernst wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> incorporate @eme64's comment suggestions > > I like it, thanks a lot for finishing that up! Incorporated and pushed your version. > > I was _trying_ to avoid a `%` operator and argue with "congruence", e.g. if a and b `equiv` mod m, then a & mask = b & mask, but did not succeed at it :-) @mernst-github Glad you like it ? I ran testing one more time, since the VM code changed slightly since last time, I think. If @merykitty doesn't disagree we can hopefully integrate after the weekend :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2658652671 From rcastanedalo at openjdk.org Fri Feb 14 09:16:09 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 09:16:09 GMT Subject: RFR: 8350006: IGV: show memory slices as type information In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 08:56:42 GMT, Daniel Lund?n wrote: > Looks good and useful. Thanks Daniel! > It would be nice if we would not need to rely on parsing `dump_spec` for the required information, which is potentially quite fragile if `dump_spec`s ever change. Maybe, in the future, we could dump node information in a more structured manner? E.g., `dump_spec_json` or similar. I guess it's probably too much work for too little gain. I agree that that would be more robust, on the other hand using the available `dump_spec` information means that the filter can be readily used to inspect graphs generated from older JVM versions, without having to backport HotSpot changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2658692045 From dlunden at openjdk.org Fri Feb 14 09:31:11 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 14 Feb 2025 09:31:11 GMT Subject: RFR: 8350006: IGV: show memory slices as type information In-Reply-To: References: Message-ID: <_uN-TaiL-DsHO247VG8ILqn1lHOTphN_WxRLkvNFQDc=.cbc39c3e-c1b0-4b5d-8679-f812100ddcd3@github.com> On Fri, 14 Feb 2025 09:13:40 GMT, Roberto Casta?eda Lozano wrote: > I agree that that would be more robust, on the other hand using the available `dump_spec` information means that the filter can be readily used to inspect graphs generated from older JVM versions, without having to backport HotSpot changes. True, assuming the `dump_spec` output format remains backward compatible (which is probably a quite reasonable assumption). It's difficult to motivate spending time on something like `dump_spec_json` given that IGV is a developer tool. Maybe if there are also other use cases. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2658728085 From epeter at openjdk.org Fri Feb 14 09:46:35 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 09:46:35 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v43] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 14:59:02 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch adds unsigned bounds and known bits constraints to TypeInt and TypeLong. This opens more transformation opportunities in an elegant manner as well as helps avoid some ad-hoc rules in Hotspot. >> >> In general, a `TypeInt/Long` represents a set of values `x` that satisfies: `x s>= lo && x s<= hi && x u>= ulo && x u<= uhi && (x & zeros) == 0 && (x & ones) == ones`. These constraints are not independent, e.g. an int that lies in [0, 3] in signed domain must also lie in [0, 3] in unsigned domain and have all bits but the last 2 being unset. As a result, we must canonicalize the constraints (tighten the constraints so that they are optimal) before constructing a `TypeInt/Long` instance. >> >> This is extracted from #15440 , node value transformations are left for later PRs. I have also added unit tests to verify the soundness of constraint normalization. >> >> Please kindly review, thanks a lot. >> >> Testing >> >> - [x] GHA >> - [x] Linux x64, tier 1-4 > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 56 commits: > > - Merge branch 'master' into unsignedbounds > - Merge branch 'master' into unsignedbounds > - harden SimpleCanonicalResult > - number lemmas > - include > - clean up intn_t > - refine first_violation > - assignment operator > - exhaustive tests > - make con > - ... and 46 more: https://git.openjdk.org/jdk/compare/c2fc9478...3cd25862 Things are looking steadily better! I have a few more suggestions. src/hotspot/share/opto/rangeinference.cpp line 149: > 147: > 148: Call r the smallest value not smaller than lo that satisfies bits. Since lo > 149: does not satisfy bits, lo < r (2.7) You could reorder things to make it more straight forward: - First define `r`: - the smallest value not smaller than lo that satisfies bits - Then construct `v` - Lemma: `v` satisfies bits (probably helpful to get started...) - Statement: `r = v` - Proof - (a): `r <= v` (using the Lemma) - (b): `r >= v` (via contradiction if `r < v`) src/hotspot/share/opto/rangeinference.cpp line 153: > 151: a. Firstly, we prove that r <= v: > 152: > 153: Trivially, lo < v since: Suggestion: We know that lo < v since: It's a little funny to say its trivial and then go ahead and prove it ? src/hotspot/share/opto/rangeinference.cpp line 158: > 156: bits at x > i have lower significance, and are thus irrelevant > 157: > 158: As established above, the first (i + 1) bits of v satisfy bits. Two things: - I don't see immediately how it is true - And we are not trying to prove that v satisfies bits here. But we should prove that somewhere, just not under section `a. Firstly, we prove that r <= v:`, right? src/hotspot/share/opto/rangeinference.cpp line 164: > 162: > 163: As a result, v > lo and v satisfies bits since all of its bits satisfy bits. Which > 164: means r <= v since r is the smallest such value. This is a little quick... - You proved `v > lo` above, so give it a name and reference it here. - `v satisfies bits` can be its separate step as I suggested above. Give it a name and reference it here. - `v` is a value larger than `lo` that satisfies bits, and `r` is the smallest value larger than `lo` that satisfies bits, hence `v >= r`. src/hotspot/share/opto/rangeinference.cpp line 166: > 164: means r <= v since r is the smallest such value. > 165: > 166: b. Secondly, we prove that r == v. Suppose r < v: Suggestion: b. Secondly, we prove that r >= v. Suppose r < v: src/hotspot/share/opto/rangeinference.cpp line 179: > 177: v[j] == 1 (according to 2.b.2) > 178: r[x] == v[x], for x < j (according to 2.b.3) > 179: v[x] == lo[x], for x < j (according to 2,4 because j < i) Suggestion: v[x] == lo[x], for x < j (according to 2.4 because j < i) src/hotspot/share/opto/rangeinference.cpp line 184: > 182: r[j] == 0 > 183: lo[j] == 1 > 184: r[x] == lo[x], for x < j Suggestion: r[j] == 0 (according to 2.b.1) lo[j] == 1 (lo[j] == v[j] == 1, according to 2.4 because j < i, and according to 2.b.2) r[x] == lo[x], for x < j (according to 2.4 because j < i) Then you could even drop the other list above, right? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17508#pullrequestreview-2617160068 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1955816531 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1955799188 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1955809601 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1955828802 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1955832072 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1955837820 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1955849996 From rcastanedalo at openjdk.org Fri Feb 14 10:05:45 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 10:05:45 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: > This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: > > ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) > > #### Testing > > - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > > - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Increase property print buffer to avoid truncating dump_spec ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23621/files - new: https://git.openjdk.org/jdk/pull/23621/files/7b3b25d3..53258db4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23621&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23621&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23621.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23621/head:pull/23621 PR: https://git.openjdk.org/jdk/pull/23621 From rcastanedalo at openjdk.org Fri Feb 14 10:11:13 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 10:11:13 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:05:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Increase property print buffer to avoid truncating dump_spec @chhagedorn reported offline that the slice of `MergeMem` nodes is sometimes missing (thanks for the report!). This is due to HotSpot truncating the `dump_spec` property value, which can be particularly verbose for wide `MergeMem` nodes. Commit 53258db4 increases the size of the (debug-only) HotSpot buffer from 512 to 2048 characters, which should be sufficient for most practical cases. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2658846714 From dlunden at openjdk.org Fri Feb 14 10:16:09 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 14 Feb 2025 10:16:09 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:05:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Increase property print buffer to avoid truncating dump_spec Marked as reviewed by dlunden (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23621#pullrequestreview-2617336508 From chagedorn at openjdk.org Fri Feb 14 10:20:15 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 14 Feb 2025 10:20:15 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:05:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Increase property print buffer to avoid truncating dump_spec Looks good and useful! Works as expected. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23621#pullrequestreview-2617348568 From rcastanedalo at openjdk.org Fri Feb 14 10:30:10 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 10:30:10 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:17:53 GMT, Christian Hagedorn wrote: > Looks good and useful! Works as expected. Thanks, Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2658917699 From qamai at openjdk.org Fri Feb 14 10:43:12 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Feb 2025 10:43:12 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: <-be1ozb05LFuRONIw7U7rXKUCXwMNut9yphPDNAK_7g=.32fb0d69-e080-49cc-b518-4bf81e536589@github.com> On Fri, 14 Feb 2025 10:05:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Increase property print buffer to avoid truncating dump_spec I assume `mem:6` is the same `6` as `6: int` at `Start`. It would be great if we can also know what kind of `int` we are talking about here (e.g. an `int` at `ArrayList.size` or a bottom `int`). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2658947941 From qamai at openjdk.org Fri Feb 14 10:48:09 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Feb 2025 10:48:09 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:05:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Increase property print buffer to avoid truncating dump_spec It also looks strange to me that the input of an `int` memory node is a `raw` memory node, and the input of an `Object*` memory node is the aforementioned `int` memory node, or a `bot` memory node merges into a `raw` node. It seems I am really uninformed here ~~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2658958065 From rcastanedalo at openjdk.org Fri Feb 14 12:16:12 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 12:16:12 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: <-be1ozb05LFuRONIw7U7rXKUCXwMNut9yphPDNAK_7g=.32fb0d69-e080-49cc-b518-4bf81e536589@github.com> References: <-be1ozb05LFuRONIw7U7rXKUCXwMNut9yphPDNAK_7g=.32fb0d69-e080-49cc-b518-4bf81e536589@github.com> Message-ID: <5bkom_j0vx66NDG59WoTYLCKksYcByYzlAcfAQcX5lg=.b7bc3734-0ce8-4a07-995a-163d7896b9f6@github.com> On Fri, 14 Feb 2025 10:41:00 GMT, Quan Anh Mai wrote: > I assume `mem:6` is the same `6` as `6: int` at `Start`. No, `6` in `mem:6` is the index of the memory slice defined by `94 StoreI`, whereas `6:int` in `3 Start` indicates that the element number 6 of the tuple defined by `3 Start` is of type `int`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2659178115 From thartmann at openjdk.org Fri Feb 14 12:18:10 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 14 Feb 2025 12:18:10 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic There is at least one additional bug here. Running this test fails even with the fix: /* * Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved. * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. * * This code is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 only, as * published by the Free Software Foundation. * * This code is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License * version 2 for more details (a copy is included in the LICENSE file that * accompanied this code). * * You should have received a copy of the GNU General Public License version * 2 along with this work; if not, write to the Free Software Foundation, * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. * * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA * or visit www.oracle.com if you need additional information or have any * questions. */ // Run with java -Xbatch -XX:-TieredCompilation TestShort.java public class TestShort { public static void test() { short[] vals = new short[1024]; short[] results = new short[1024]; for (int i = 0; i < 1024; ++i) { vals[i] = 12; } for (int i = 0; i < 1024; ++i) { results[i] = (short)Integer.numberOfLeadingZeros(vals[i]); } for (int i = 0; i < 1024; ++i) { if (results[i] != 28) throw new RuntimeException("Wrong result"); } } public static void main(String[] args) { for (int i = 0; i < 10_000; ++i) { test(); } } } I found this with another test that I wrote to perform an exhaustive search over all the possible inputs with <= int-size. Will share that one later. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659179429 From rcastanedalo at openjdk.org Fri Feb 14 12:22:09 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 12:22:09 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: <-be1ozb05LFuRONIw7U7rXKUCXwMNut9yphPDNAK_7g=.32fb0d69-e080-49cc-b518-4bf81e536589@github.com> References: <-be1ozb05LFuRONIw7U7rXKUCXwMNut9yphPDNAK_7g=.32fb0d69-e080-49cc-b518-4bf81e536589@github.com> Message-ID: On Fri, 14 Feb 2025 10:41:00 GMT, Quan Anh Mai wrote: > It would be great if we can also know what kind of int we are talking about here (e.g. an int at ArrayList.size or a bottom int). IGV presents node types as know by C2: either phase types if available, or bottom types, or both if they differ. See the description and discussion in https://github.com/openjdk/jdk/pull/15881 for more details. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2659194658 From rcastanedalo at openjdk.org Fri Feb 14 12:28:15 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 12:28:15 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: <5bkom_j0vx66NDG59WoTYLCKksYcByYzlAcfAQcX5lg=.b7bc3734-0ce8-4a07-995a-163d7896b9f6@github.com> References: <-be1ozb05LFuRONIw7U7rXKUCXwMNut9yphPDNAK_7g=.32fb0d69-e080-49cc-b518-4bf81e536589@github.com> <5bkom_j0vx66NDG59WoTYLCKksYcByYzlAcfAQcX5lg=.b7bc3734-0ce8-4a07-995a-163d7896b9f6@github.com> Message-ID: On Fri, 14 Feb 2025 12:13:40 GMT, Roberto Casta?eda Lozano wrote: > > I assume `mem:6` is the same `6` as `6: int` at `Start`. > > No, `6` in `mem:6` is the index of the memory slice defined by `94 StoreI`, whereas `6:int` in `3 Start` indicates that the element number 6 of the tuple defined by `3 Start` is of type `int`. @merykitty would it be less confusing if the memory node type was presented as e.g. `mem: idx=6` instead of just `mem: 6`? We try to keep IGV node labels as compact as possible because wide nodes kill graph readability, but adding `idx=` would not hurt much if it makes it easier to interpret the type information. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2659202245 From qamai at openjdk.org Fri Feb 14 12:28:15 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Feb 2025 12:28:15 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: <-be1ozb05LFuRONIw7U7rXKUCXwMNut9yphPDNAK_7g=.32fb0d69-e080-49cc-b518-4bf81e536589@github.com> <5bkom_j0vx66NDG59WoTYLCKksYcByYzlAcfAQcX5lg=.b7bc3734-0ce8-4a07-995a-163d7896b9f6@github.com> Message-ID: <6vOVgg68hOmaror886Xzfv-6ZMBANTo1nPjymRid5P0=.13a79806-e160-4a43-b94f-9f42435d6ef7@github.com> On Fri, 14 Feb 2025 12:23:06 GMT, Roberto Casta?eda Lozano wrote: >>> I assume `mem:6` is the same `6` as `6: int` at `Start`. >> >> No, `6` in `mem:6` is the index of the memory slice defined by `94 StoreI`, whereas `6:int` in `3 Start` indicates that the element number 6 of the tuple defined by `3 Start` is of type `int`. > >> > I assume `mem:6` is the same `6` as `6: int` at `Start`. >> >> No, `6` in `mem:6` is the index of the memory slice defined by `94 StoreI`, whereas `6:int` in `3 Start` indicates that the element number 6 of the tuple defined by `3 Start` is of type `int`. > > @merykitty would it be less confusing if the memory node type was presented as e.g. `mem: idx=6` instead of just `mem: 6`? We try to keep IGV node labels as compact as possible because wide nodes kill graph readability, but adding `idx=` would not hurt much if it makes it easier to interpret the type information. @robcasloz I think adding `idx=` would not help, I can instinctively understand that `6` here is the index of the memory slice. What I can't see from the IGV is what that slice is. It may be helpful if we can dump all the memory slice details somewhere. For example at the `Root` node or the `Start` node. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2659207655 From thartmann at openjdk.org Fri Feb 14 12:35:11 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 14 Feb 2025 12:35:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic And a similar issue with byte-size vectors: // Run with java -Xbatch -XX:-TieredCompilation TestByte.java public class TestByte { public static void test() { byte[] vals = new byte[1024]; byte[] results = new byte[1024]; for (int i = 0; i < 1024; ++i) { results[i] = (byte)Integer.numberOfLeadingZeros(vals[i]); } for (int i = 0; i < 1024; ++i) { if (results[i] != 32) throw new RuntimeException("Wrong result"); } } public static void main(String[] args) { for (int i = 0; i < 10_000; ++i) { test(); } } } ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659220781 From rcastanedalo at openjdk.org Fri Feb 14 12:40:10 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Feb 2025 12:40:10 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: <-be1ozb05LFuRONIw7U7rXKUCXwMNut9yphPDNAK_7g=.32fb0d69-e080-49cc-b518-4bf81e536589@github.com> <5bkom_j0vx66NDG59WoTYLCKksYcByYzlAcfAQcX5lg=.b7bc3734-0ce8-4a07-995a-163d7896b9f6@github.com> Message-ID: On Fri, 14 Feb 2025 12:23:06 GMT, Roberto Casta?eda Lozano wrote: >>> I assume `mem:6` is the same `6` as `6: int` at `Start`. >> >> No, `6` in `mem:6` is the index of the memory slice defined by `94 StoreI`, whereas `6:int` in `3 Start` indicates that the element number 6 of the tuple defined by `3 Start` is of type `int`. > >> > I assume `mem:6` is the same `6` as `6: int` at `Start`. >> >> No, `6` in `mem:6` is the index of the memory slice defined by `94 StoreI`, whereas `6:int` in `3 Start` indicates that the element number 6 of the tuple defined by `3 Start` is of type `int`. > > @merykitty would it be less confusing if the memory node type was presented as e.g. `mem: idx=6` instead of just `mem: 6`? We try to keep IGV node labels as compact as possible because wide nodes kill graph readability, but adding `idx=` would not hurt much if it makes it easier to interpret the type information. > @robcasloz I think adding `idx=` would not help, I can instinctively understand that `6` here is the index of the memory slice. OK, great. > What I can't see from the IGV is what that slice is. It may be helpful if we can dump all the memory slice details somewhere. For example at the `Root` node or the `Start` node. What do you think? What memory slice details do you mean? Could you perhaps give an example of what additional information you would like to see? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2659232163 From qamai at openjdk.org Fri Feb 14 12:51:10 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Feb 2025 12:51:10 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:05:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Increase property print buffer to avoid truncating dump_spec In `MemNode::dump_adr_type`, we have this: ciField* field = atp->field(); if (field) { st->print(", name="); field->print_name_on(st); } st->print(", idx=%d;", atp->index()); By details, I mean the things like `field->print_name_on(st)` here. It would be good if we can know all the details about the memory slice from looking at the IGV, we can have it being an option that can be toggled if you are afraid that it would be too long. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2659251312 From qamai at openjdk.org Fri Feb 14 12:59:13 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Feb 2025 12:59:13 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v44] In-Reply-To: References: Message-ID: > Hi, > > This patch adds unsigned bounds and known bits constraints to TypeInt and TypeLong. This opens more transformation opportunities in an elegant manner as well as helps avoid some ad-hoc rules in Hotspot. > > In general, a `TypeInt/Long` represents a set of values `x` that satisfies: `x s>= lo && x s<= hi && x u>= ulo && x u<= uhi && (x & zeros) == 0 && (x & ones) == ones`. These constraints are not independent, e.g. an int that lies in [0, 3] in signed domain must also lie in [0, 3] in unsigned domain and have all bits but the last 2 being unset. As a result, we must canonicalize the constraints (tighten the constraints so that they are optimal) before constructing a `TypeInt/Long` instance. > > This is extracted from #15440 , node value transformations are left for later PRs. I have also added unit tests to verify the soundness of constraint normalization. > > Please kindly review, thanks a lot. > > Testing > > - [x] GHA > - [x] Linux x64, tier 1-4 Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: refine comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17508/files - new: https://git.openjdk.org/jdk/pull/17508/files/3cd25862..55860366 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17508&range=43 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17508&range=42-43 Stats: 64 lines in 1 file changed: 12 ins; 12 del; 40 mod Patch: https://git.openjdk.org/jdk/pull/17508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17508/head:pull/17508 PR: https://git.openjdk.org/jdk/pull/17508 From qamai at openjdk.org Fri Feb 14 12:59:16 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Feb 2025 12:59:16 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v43] In-Reply-To: References: Message-ID: <028cerW0RlYNoZCu5PswzzCYlPVbeIhPuxyQsy8t_FY=.6d3046b8-1b82-467a-8eae-c96601933b56@github.com> On Fri, 14 Feb 2025 09:20:10 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 56 commits: >> >> - Merge branch 'master' into unsignedbounds >> - Merge branch 'master' into unsignedbounds >> - harden SimpleCanonicalResult >> - number lemmas >> - include >> - clean up intn_t >> - refine first_violation >> - assignment operator >> - exhaustive tests >> - make con >> - ... and 46 more: https://git.openjdk.org/jdk/compare/c2fc9478...3cd25862 > > src/hotspot/share/opto/rangeinference.cpp line 158: > >> 156: bits at x > i have lower significance, and are thus irrelevant >> 157: >> 158: As established above, the first (i + 1) bits of v satisfy bits. > > Two things: > - I don't see immediately how it is true > - And we are not trying to prove that v satisfies bits here. But we should prove that somewhere, just not under section `a. Firstly, we prove that r <= v:`, right? The layout here may be a little confusing, I have added more detailed section label. In this part we are proving that `r <= v` by proving that `v > lo` (a.1) and `v` satisfies `bits` (a.2). > src/hotspot/share/opto/rangeinference.cpp line 166: > >> 164: means r <= v since r is the smallest such value. >> 165: >> 166: b. Secondly, we prove that r == v. Suppose r < v: > > Suggestion: > > b. Secondly, we prove that r >= v. Suppose r < v: Doing it like this avoids having to mention `r <= v` again further below. We just say that since we know `r <= v`, the contradiction of `r == v` would be `r < v`. > src/hotspot/share/opto/rangeinference.cpp line 184: > >> 182: r[j] == 0 >> 183: lo[j] == 1 >> 184: r[x] == lo[x], for x < j > > Suggestion: > > r[j] == 0 (according to 2.b.1) > lo[j] == 1 (lo[j] == v[j] == 1, according to 2.4 because j < i, and according to 2.b.2) > r[x] == lo[x], for x < j (according to 2.4 because j < i) > > Then you could even drop the other list above, right? Yes you are right, thanks a lot for your suggestions, I have done as you suggested and improved the format a little bit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1956103750 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1956105088 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1956107220 From epeter at openjdk.org Fri Feb 14 13:04:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 13:04:16 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 27: > 25: /** > 26: * @test > 27: * @bug 8297172 8331993 8349637 Drive-by comment. The `@requires` below are really a shame. This test now is not run in other places, so we would not catch issues elsewhere. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1956114269 From chagedorn at openjdk.org Fri Feb 14 13:23:09 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 14 Feb 2025 13:23:09 GMT Subject: RFR: 8349858: Print compilation task before blocking compiler thread for shutdown In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 12:56:17 GMT, Aleksey Shipilev wrote: > JIT compilers in current Hotspot are compiling the code while being in native state. So if there is a running compilation, it does not block shutdown naturally. The shutdown code has cooperative mechanism to coordinate shutdown of compiler threads. Shutdown code sets the `CompilerBroker::should_block`, and compilers are regularly checking it with `CompilerBroker::maybe_block`. When shutdown is pending, the running compiler threads would eventually hit that `maybe_block`, block at transition to VM state, and that would allow shutdown to proceed. > > One of the problems with this mechanism is observability: if compiler thread was running a long-running compilation, nothing would be written in the compilation logs about it. The compilation would just -- poof! -- disappear without a trace. This is arguably against the user expectation: we print _something_ whether the compilation succeeded or failed. > > This kind of shutdown-during-heavy-compilation regularly happens in short runs in Leyden benchmarks. It made me scratch my head for quite a while before I understood where the compilation task went. I would like to add some sort of diagnostics for these cases. > > Example `-XX:+PrintCompilation` output in Leyden after the patch (includes richer compile-task timings): > > > ... > 430 W3.4 Q2.7 C0.3 4397 com.sun.tools.javac.comp.Check::checkProfile (40 bytes) > 447 W0.0 Q0.0 C10.3 4398 java.util.StringJoiner::toString (53 bytes) > 456 W0.0 Q10.4 C9.7 4399 java.lang.System$1::join (11 bytes) > Generated source code for 51 classes and compiled them in 403 ms (1 iterations) > 476 W36.6 Q11.6 C72.1 4393 com.sun.tools.javac.jvm.PoolWriter$WriteablePoolHelper::writeConstant (843 bytes) blocked > 481 W0.0 Q0.0 C157.6 4390 com.sun.tools.javac.comp.TransTypes::visitIdent (129 bytes) blocked > > > > Additional testing: > - [x] Linux x86_64 server fastdebug, `all` Looks good to me, too. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23586#pullrequestreview-2617784470 From epeter at openjdk.org Fri Feb 14 13:29:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 13:29:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 12:32:46 GMT, Tobias Hartmann wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > And a similar issue with byte-size vectors: > > > // Run with java -Xbatch -XX:-TieredCompilation TestByte.java > > public class TestByte { > > public static void test() { > byte[] vals = new byte[1024]; > byte[] results = new byte[1024]; > for (int i = 0; i < 1024; ++i) { > results[i] = (byte)Integer.numberOfLeadingZeros(vals[i]); > } > for (int i = 0; i < 1024; ++i) { > if (results[i] != 32) throw new RuntimeException("Wrong result"); > } > } > > public static void main(String[] args) { > for (int i = 0; i < 10_000; ++i) { > test(); > } > } > } @TobiHartmann > I also noticed that the code does not vectorize without the (short) cast, i.e., when using an integer result array. Right, this currently is an auto-vectorizer limitation, that @jaskarth is working on here: https://github.com/openjdk/jdk/pull/23413 Example: public class TestBI { public static void test() { byte[] vals = new byte[1024]; int[] results = new int[1024]; for (int i = 0; i < 1024; ++i) { results[i] = Integer.numberOfLeadingZeros(vals[i]); } for (int i = 0; i < 1024; ++i) { if (results[i] != 32) throw new RuntimeException("Wrong result"); } } public static void main(String[] args) { for (int i = 0; i < 10_000; ++i) { test(); } } } Run with: `java -Xbatch -XX:-TieredCompilation -XX:UseAVX=2 -XX:CompileCommand=compileonly,TestBI::test -XX:+TraceNewVectors -XX:CompileCommand=printassembly,TestBI::testx -XX:CompileCommand=TraceAutoVectorization,TestBI::test,ALL TestBI.java` You see that auto-vectorization rejects the packs, because there is no cast from the `byte` to `int` packs: WARNING: Removed pack: not profitable: 0: 672 LoadB === 708 57 673 [[ 671 ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #byte !orig=568,438,184 !jvms: TestBI::test @ bci:25 (line 9) 1: 686 LoadB === 708 57 687 [[ 685 ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #byte !orig=579,184 !jvms: TestBI::test @ bci:25 (line 9) 2: 695 LoadB === 708 57 696 [[ 694 ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #byte !orig=438,184 !jvms: TestBI::test @ bci:25 (line 9) 3: 692 LoadB === 708 57 693 [[ 691 ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #byte !orig=184 !jvms: TestBI::test @ bci:25 (line 9) 4: 568 LoadB === 708 57 569 [[ 567 ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #byte !orig=438,184 !jvms: TestBI::test @ bci:25 (line 9) 5: 579 LoadB === 708 57 580 [[ 578 ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #byte !orig=184 !jvms: TestBI::test @ bci:25 (line 9) 6: 438 LoadB === 708 57 439 [[ 437 ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #byte !orig=184 !jvms: TestBI::test @ bci:25 (line 9) 7: 184 LoadB === 708 57 182 [[ 185 ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #byte !jvms: TestBI::test @ bci:25 (line 9) WARNING: Removed pack: not profitable: 0: 671 CountLeadingZerosI === _ 672 [[ 669 ]] !orig=567,437,185 !jvms: TestBI::test @ bci:26 (line 9) 1: 685 CountLeadingZerosI === _ 686 [[ 668 ]] !orig=578,185 !jvms: TestBI::test @ bci:26 (line 9) 2: 694 CountLeadingZerosI === _ 695 [[ 667 ]] !orig=437,185 !jvms: TestBI::test @ bci:26 (line 9) 3: 691 CountLeadingZerosI === _ 692 [[ 666 ]] !orig=185 !jvms: TestBI::test @ bci:26 (line 9) 4: 567 CountLeadingZerosI === _ 568 [[ 565 ]] !orig=437,185 !jvms: TestBI::test @ bci:26 (line 9) 5: 578 CountLeadingZerosI === _ 579 [[ 564 ]] !orig=185 !jvms: TestBI::test @ bci:26 (line 9) 6: 437 CountLeadingZerosI === _ 438 [[ 435 ]] !orig=185 !jvms: TestBI::test @ bci:26 (line 9) 7: 185 CountLeadingZerosI === _ 184 [[ 204 ]] !jvms: TestBI::test @ bci:26 (line 9) WARNING: Removed pack: not profitable: 0: 669 StoreI === 708 709 670 671 [[ 668 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; Memory: @int[int:1024] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=565,435,204,[263],[272] !jvms: TestBI::test @ bci:29 (line 9) 1: 668 StoreI === 708 669 680 685 [[ 667 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; Memory: @int[int:1024] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=564,204,[263],[272] !jvms: TestBI::test @ bci:29 (line 9) 2: 667 StoreI === 708 668 677 694 [[ 666 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; Memory: @int[int:1024] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=435,204,[263],[272] !jvms: TestBI::test @ bci:29 (line 9) 3: 666 StoreI === 708 667 674 691 [[ 565 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; Memory: @int[int:1024] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=204,[263],[272] !jvms: TestBI::test @ bci:29 (line 9) 4: 565 StoreI === 708 666 566 567 [[ 564 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; Memory: @int[int:1024] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=435,204,[263],[272] !jvms: TestBI::test @ bci:29 (line 9) 5: 564 StoreI === 708 565 570 578 [[ 435 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; Memory: @int[int:1024] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=204,[263],[272] !jvms: TestBI::test @ bci:29 (line 9) 6: 435 StoreI === 708 564 436 437 [[ 204 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; Memory: @int[int:1024] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=204,[263],[272] !jvms: TestBI::test @ bci:29 (line 9) 7: 204 StoreI === 708 435 202 185 [[ 709 207 420 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; Memory: @int[int:1024] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=[263],[272] !jvms: TestBI::test @ bci:29 (line 9) So we would have to add in a `VectorCast` between the packs of `LoadB` and `CountLeadingZerosI`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659337152 From thartmann at openjdk.org Fri Feb 14 13:47:11 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 14 Feb 2025 13:47:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic Ah, right, makes sense. Thanks for checking, Emanuel. And nice to see that limitation being addressed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659377300 From epeter at openjdk.org Fri Feb 14 13:47:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 13:47:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 13:44:40 GMT, Tobias Hartmann wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > Ah, right, makes sense. Thanks for checking, Emanuel. And nice to see that limitation being addressed. I looked at @TobiHartmann Examples. I rewrote them a little, to make things more apparent. public class TestByte { static byte[] vals = new byte[1024]; static byte[] results = new byte[1024]; public static void test() { for (int i = 0; i < 1024; ++i) { results[i] = (byte)Integer.numberOfLeadingZeros(vals[i]); } } public static void main(String[] args) { for (int j = 0; j < 10_000; ++j) { test(); for (int i = 0; i < 1024; ++i) { if (results[i] != 32) throw new RuntimeException("Wrong result: " + results[i] + " at " + i); } } } } Running it like this: java -Xbatch -XX:-TieredCompilation -XX:UseAVX=2 -XX:CompileCommand=compileonly,TestByte::test -XX:+TraceNewVectors TestByte.java CompileCommand: compileonly TestByte.test bool compileonly = true TraceNewVectors [AutoVectorization]: 1190 LoadVector === 943 1080 1055 [[ ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched #vectory (does not depend only on test, unknown control) TraceNewVectors [AutoVectorization]: 1191 CountLeadingZerosV === _ 1190 [[ ]] #vectory TraceNewVectors [AutoVectorization]: 1192 StoreVector === 1079 1080 1021 1191 [[ ]] @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; Exception in thread "main" java.lang.RuntimeException: Wrong result: 8 at 24 at TestByte.main(TestByte.java:18) See the error message: `Wrong result: 8 at 24`, we get `8` instead of the expected `32`. We see that we have a `LoadVector` of `bytes` going directly into `CountLeadingZerosV` for `bytes`, and the result is stored as `bytes` with `StoreVector`. But this is not ok, because `CountLeadingZerosV` for `bytes` will have a different result, it only operates on `8 bytes`, and so if it sees a `zero = 0`, it sees only `8` leading zeros. But we called `Integer.numberOfLeadingZeros`, which expects to operate on `32 bit` integers, where a zero has `32` leading zeros. ---------------------------------------------- Let's do the same with `shorts`: public class TestShort { static short[] vals = new short[1024]; static short[] results = new short[1024]; public static void test() { for (int i = 0; i < 1024; ++i) { results[i] = (short)Integer.numberOfLeadingZeros(vals[i]); } } public static void main(String[] args) { for (int j = 0; j < 10_000; ++j) { test(); for (int i = 0; i < 1024; ++i) { if (results[i] != 32) throw new RuntimeException("Wrong result " + results[i] + " at " + i); } } } } And we get: java -Xbatch -XX:-TieredCompilation -XX:UseAVX=2 -XX:CompileCommand=compileonly,TestShort::test -XX:+TraceNewVectors TestShort.java CompileCommand: compileonly TestShort.test bool compileonly = true TraceNewVectors [AutoVectorization]: 980 LoadVector === 813 885 864 [[ ]] @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched #vectory (does not depend only on test, unknown control) TraceNewVectors [AutoVectorization]: 981 CountLeadingZerosV === _ 980 [[ ]] #vectory TraceNewVectors [AutoVectorization]: 982 StoreVector === 882 885 866 981 [[ ]] @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched Memory: @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; Exception in thread "main" java.lang.RuntimeException: Wrong result 16 at 12 at TestShort.main(TestShort.java:41) Here we get `16` leading zeros because a `short` has `16` bits, rather than the expected `32` from `Integer.numberOfLeadingZeros`. Again, because `CountLeadingZerosV` for `short` operates on `shorts` and not `ints`. How did we not catch this with testing. I mean this is so trivial to catch with any vectorizing example ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659378298 From jkarthikeyan at openjdk.org Fri Feb 14 13:56:12 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 14 Feb 2025 13:56:12 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: <-KiyVSvnA7n51DtqF8qGdw8Uu-RUSironJ9DK0_k1so=.6a87df42-1c87-4c5b-9fe8-2e5ad914cc6f@github.com> On Fri, 14 Feb 2025 13:01:15 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > test/hotspot/jtreg/compiler/vectorization/TestNumberOfContinuousZeros.java line 27: > >> 25: /** >> 26: * @test >> 27: * @bug 8297172 8331993 8349637 > > Drive-by comment. > The `@requires` below are really a shame. This test now is not run in other places, so we would not catch issues elsewhere. I was thinking about refactoring the test with this in mind and using the `Generators` api for creating random values, but I thought that if we were planning on backporting this to older releases (since this bug affects 21 and 24 too) it might be better to do it separately to keep the diff small. What do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23579#discussion_r1956184917 From shade at openjdk.org Fri Feb 14 13:56:14 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 14 Feb 2025 13:56:14 GMT Subject: RFR: 8349858: Print compilation task before blocking compiler thread for shutdown In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 12:56:17 GMT, Aleksey Shipilev wrote: > JIT compilers in current Hotspot are compiling the code while being in native state. So if there is a running compilation, it does not block shutdown naturally. The shutdown code has cooperative mechanism to coordinate shutdown of compiler threads. Shutdown code sets the `CompilerBroker::should_block`, and compilers are regularly checking it with `CompilerBroker::maybe_block`. When shutdown is pending, the running compiler threads would eventually hit that `maybe_block`, block at transition to VM state, and that would allow shutdown to proceed. > > One of the problems with this mechanism is observability: if compiler thread was running a long-running compilation, nothing would be written in the compilation logs about it. The compilation would just -- poof! -- disappear without a trace. This is arguably against the user expectation: we print _something_ whether the compilation succeeded or failed. > > This kind of shutdown-during-heavy-compilation regularly happens in short runs in Leyden benchmarks. It made me scratch my head for quite a while before I understood where the compilation task went. I would like to add some sort of diagnostics for these cases. > > Example `-XX:+PrintCompilation` output in Leyden after the patch (includes richer compile-task timings): > > > ... > 430 W3.4 Q2.7 C0.3 4397 com.sun.tools.javac.comp.Check::checkProfile (40 bytes) > 447 W0.0 Q0.0 C10.3 4398 java.util.StringJoiner::toString (53 bytes) > 456 W0.0 Q10.4 C9.7 4399 java.lang.System$1::join (11 bytes) > Generated source code for 51 classes and compiled them in 403 ms (1 iterations) > 476 W36.6 Q11.6 C72.1 4393 com.sun.tools.javac.jvm.PoolWriter$WriteablePoolHelper::writeConstant (843 bytes) blocked > 481 W0.0 Q0.0 C157.6 4390 com.sun.tools.javac.comp.TransTypes::visitIdent (129 bytes) blocked > > > > Additional testing: > - [x] Linux x86_64 server fastdebug, `all` Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23586#issuecomment-2659393953 From shade at openjdk.org Fri Feb 14 13:56:15 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 14 Feb 2025 13:56:15 GMT Subject: Integrated: 8349858: Print compilation task before blocking compiler thread for shutdown In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 12:56:17 GMT, Aleksey Shipilev wrote: > JIT compilers in current Hotspot are compiling the code while being in native state. So if there is a running compilation, it does not block shutdown naturally. The shutdown code has cooperative mechanism to coordinate shutdown of compiler threads. Shutdown code sets the `CompilerBroker::should_block`, and compilers are regularly checking it with `CompilerBroker::maybe_block`. When shutdown is pending, the running compiler threads would eventually hit that `maybe_block`, block at transition to VM state, and that would allow shutdown to proceed. > > One of the problems with this mechanism is observability: if compiler thread was running a long-running compilation, nothing would be written in the compilation logs about it. The compilation would just -- poof! -- disappear without a trace. This is arguably against the user expectation: we print _something_ whether the compilation succeeded or failed. > > This kind of shutdown-during-heavy-compilation regularly happens in short runs in Leyden benchmarks. It made me scratch my head for quite a while before I understood where the compilation task went. I would like to add some sort of diagnostics for these cases. > > Example `-XX:+PrintCompilation` output in Leyden after the patch (includes richer compile-task timings): > > > ... > 430 W3.4 Q2.7 C0.3 4397 com.sun.tools.javac.comp.Check::checkProfile (40 bytes) > 447 W0.0 Q0.0 C10.3 4398 java.util.StringJoiner::toString (53 bytes) > 456 W0.0 Q10.4 C9.7 4399 java.lang.System$1::join (11 bytes) > Generated source code for 51 classes and compiled them in 403 ms (1 iterations) > 476 W36.6 Q11.6 C72.1 4393 com.sun.tools.javac.jvm.PoolWriter$WriteablePoolHelper::writeConstant (843 bytes) blocked > 481 W0.0 Q0.0 C157.6 4390 com.sun.tools.javac.comp.TransTypes::visitIdent (129 bytes) blocked > > > > Additional testing: > - [x] Linux x86_64 server fastdebug, `all` This pull request has now been integrated. Changeset: 742e735d Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/742e735d7f6c4ee9ca5a4d290c59d7d6ec1f7635 Stats: 14 lines in 1 file changed: 14 ins; 0 del; 0 mod 8349858: Print compilation task before blocking compiler thread for shutdown Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23586 From epeter at openjdk.org Fri Feb 14 14:04:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 14:04:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic @jaskarth Do you also have a test for the VectorAPI? Or is that not affected? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659412909 From qamai at openjdk.org Fri Feb 14 14:09:11 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Feb 2025 14:09:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic I am fine with the patch. @eme64 might want more detailed proofs, so it would be great if you can be more rigorous, though :) This seems to be a separate issue from the vectorization with short and byte. ------------- Marked as reviewed by qamai (Committer). PR Review: https://git.openjdk.org/jdk/pull/23579#pullrequestreview-2617891247 From thartmann at openjdk.org Fri Feb 14 14:09:11 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 14 Feb 2025 14:09:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic Nice analysis, Emanuel. Here's my test: https://github.com/openjdk/jdk/commit/7fd87d3013d23e151c98e451c6cd07bf55b9507b @jaskarth could you please add something similar to this PR? At least for the integer / long cases. I'm fine with fixing the short/byte issue separately. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659423700 From epeter at openjdk.org Fri Feb 14 14:12:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 14:12:12 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 14:01:05 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > @jaskarth Do you also have a test for the VectorAPI? Or is that not affected? > I am fine with the patch. @eme64 might want more detailed proofs, so it would be great if you can be more rigorous, though :) This seems to be a separate issue from the vectorization with short and byte. Yes, the truncation issue with `short` and `byte` is a separate issue. It is an issue with `SuperWord` / AutoVectorization, and not with the backend. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659433583 From rrich at openjdk.org Fri Feb 14 14:25:12 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 14 Feb 2025 14:25:12 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: <_jQq0sGaZijMT6Cr3rUdrQLlvVWNuJ8uILHg1qkCxoM=.271ce257-9e72-4d61-87fa-588ce4dbe107@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> <_jQq0sGaZijMT6Cr3rUdrQLlvVWNuJ8uILHg1qkCxoM=.271ce257-9e72-4d61-87fa-588ce4dbe107@github.com> Message-ID: On Thu, 13 Feb 2025 21:32:54 GMT, Dean Long wrote: > > The 2nd assert does not fail w/o the deoptimization.cpp fix. Might be due to alignement of caller->sp() in the interpreter. > > Aarch64 also does alignment, and that's why the test uses two different methods, one with an extra local, to hopefully handle both cases of even/odd 2-word (16 byte) alignment. But ppc might be different enough that this isn't enough to trigger the bug. Or maybe the end of frame bound is slightly off? I think you can make the assertion a little stricter like this https://github.com/reinrich/jdk/commit/9c3c8a33a29b9ae6c4c703992b306dc0cbbcd2f0. The test still doesn't fail on ppc64 w/o the fix. This is because the deoptee's caller is alwys enlarged [here](https://github.com/openjdk/jdk/blob/57f4c30fb6be1da57c8fcc742b5c36d842eef397/src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp#L2840) although it's only necessary if it is the entry frame or compiled. (Reasoning for the stricter assertion: interpreter frames on top of stack have a `frame::top_ijava_frame_abi` just above sp needed for VM calls. When a call is received by the interpreter, it trimms the abi of the caller back to `frame::parent_ijava_frame_abi`. An i2c adapter does not do this.) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2659465735 From mli at openjdk.org Fri Feb 14 14:36:21 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 14 Feb 2025 14:36:21 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare Message-ID: Hi, Can you help to review this patch? Currently, `string_compare` code is a bit complicated, main reasons include: 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. This is not good for code reading and maintaining. So, this patch does following refactoring: 1. merge LU and UL code into one, i.e. remove UL code. 2. seperate the code into 2 methods: LL/UU and LU/UL. 3. some other misc improvement. I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. 2. make `SHORT_STRING` case simpler. Thanks ------------- Commit messages: - blank lines - simplify - clean - merge UL and LU - move to functions - move alignment code of LL&UU down from common code path - initial commit Changes: https://git.openjdk.org/jdk/pull/23633/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350095 Stats: 289 lines in 3 files changed: 149 ins; 125 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/23633.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23633/head:pull/23633 PR: https://git.openjdk.org/jdk/pull/23633 From shade at openjdk.org Fri Feb 14 14:51:11 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 14 Feb 2025 14:51:11 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 [v2] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:50:51 GMT, Aleksey Shipilev wrote: >> Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: >> >> >> ... >> [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 >> [99] javax.enterprise.deploy.shared.DConfigBeanVersionType >> [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 >> [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 >> ... >> >> >> I narrowed it down to level downgrade in compilation policy here: >> https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 >> >> [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. >> >> I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. >> >> I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Shortcut CTW tasks directly @eme64, @chhagedorn -- could you take a look as well? There is a recent hole in CTW testing that I want to plug before weekend runs start :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23589#issuecomment-2659524425 From stuefe at openjdk.org Fri Feb 14 14:55:20 2025 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 14 Feb 2025 14:55:20 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer Message-ID: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> Somewhat trivial. I recently hunted a bug for an hour until I realized that I had accessed ciEnv::compiler_data() as C2 `Compile` when, in fact, it was C1 `Compilation`. Stupid mistake, but an assert is easy to do and saves time. ------------- Commit messages: - copyrights - start Changes: https://git.openjdk.org/jdk/pull/23635/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23635&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350097 Stats: 33 lines in 6 files changed: 25 ins; 1 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23635.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23635/head:pull/23635 PR: https://git.openjdk.org/jdk/pull/23635 From epeter at openjdk.org Fri Feb 14 15:16:23 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 15:16:23 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic @jaskarth I have a Vector API test that should also reproduce the issue, and I used your special values. import jdk.incubator.vector.IntVector; import jdk.incubator.vector.VectorOperators; import jdk.incubator.vector.VectorSpecies; public class TestVectorAPI { private static final VectorSpecies SPECIES = IntVector.SPECIES_PREFERRED; static int SIZE = 1024; private static final int[] SPECIAL = { 0x01FFFFFF, 0x03FFFFFE, 0x07FFFFFC, 0x0FFFFFF8, 0x1FFFFFF0, 0x3FFFFFE0, 0xFFFFFFFF }; static int[] a = new int[SIZE]; static int[] r = new int[SIZE]; static int[] gold = new int[SIZE]; public static void main(String[] args) { for (int i = 0; i < a.length; i++) { a[i] = SPECIAL[i % SPECIAL.length]; } System.out.println("Compute in interpreter:"); test(gold); System.out.println("Warmup and compute with compiled eventually:"); for (int i = 0; i < 10_000; i++) { test(r); } System.out.println("Verify:"); for (int i = 0; i < a.length; i++) { if (r[i] != gold[i]) { throw new RuntimeException("Wrong value: " + r[i] + " vs " + gold[i] + " for " + a[i] + " at " + i); } } } static void test(int[] out) { for (int i = 0; i < a.length; i += SPECIES.length()) { IntVector av = IntVector.fromArray(SPECIES, a, i); av.lanewise(VectorOperators.LEADING_ZEROS_COUNT).intoArray(out, i); } } } Run with: `java -XX:UseAVX=2 -Xbatch -XX:CompileCommand=compileonly,TestVectorAPI::test -XX:CompileCommand=printcompilation,TestVectorAPI::test -XX:-PrintIdeal TestVectorAPI.java` ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659586753 From epeter at openjdk.org Fri Feb 14 15:16:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 15:16:24 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 19:23:51 GMT, Paul Sandoz wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > Thank you for fixing this. More broadly we should double check the intrinsics of Long/Integer.numberOfLeading/Trailing/Zeros (we added in the integration of Integration of JEP 426: Vector API) and follow up with any necessary tests and/or fixes in subsequent PRs. @PaulSandoz @jatin-bhateja This should really have been caught by VectorAPI testing, but apparently not the whole range was covered. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659590282 From jkarthikeyan at openjdk.org Fri Feb 14 15:19:16 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 14 Feb 2025 15:19:16 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: <4ha-_SH1qEhLz89ynu5rLd9XAtSIrmK_ArtBjDeKOao=.ef4b513a-0b69-416c-bd6d-56cfe6d4195f@github.com> On Fri, 14 Feb 2025 15:13:43 GMT, Emanuel Peter wrote: >> Thank you for fixing this. More broadly we should double check the intrinsics of Long/Integer.numberOfLeading/Trailing/Zeros (we added in the integration of Integration of JEP 426: Vector API) and follow up with any necessary tests and/or fixes in subsequent PRs. > > @PaulSandoz @jatin-bhateja This should really have been caught by VectorAPI testing, but apparently not the whole range was covered. @eme64 Thanks for the detailed analysis! I just checked the Vector API, and I believe it's not affected. I think the issue is that we have conflicting semantics for leading zeros with autovectorization and the Vector API, since `Integer.numberOfLeadingZeros((byte)1) == 31`, but `ByteVector.broadcast(ByteVector.SPECIES_256, 1).lanewise(VectorOperators.LEADING_ZEROS_COUNT) == 7`. I think it shouldn't be too hard to adapt the current logic for autovectorization semantics, we could do something like this: dst = current_leading_zeros_byte(src); tmp = src >= 0 ? 24 : 0; // vector blend operation, should be constant 16 for shorts dst = dst + tmp But I agree that since this is an entirely different issue it'd be good to do it separately. @TobiHartmann Thanks a lot for the thorough test, I'll make sure to adapt it and add it to this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659596580 From shade at openjdk.org Fri Feb 14 15:21:16 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 14 Feb 2025 15:21:16 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer In-Reply-To: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> References: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> Message-ID: On Fri, 14 Feb 2025 14:48:34 GMT, Thomas Stuefe wrote: > Somewhat trivial. > > I recently hunted a bug for an hour until I realized that I had accessed ciEnv::compiler_data() as C2 `Compile` when, in fact, it was C1 `Compilation`. Stupid mistake, but an assert is easy to do and saves time. src/hotspot/share/c1/c1_Compilation.hpp line 35: > 33: #include "compiler/compilerDirectives.hpp" > 34: #include "runtime/deoptimization.hpp" > 35: #include "utilities/debug.hpp" `DEBUG_ONLY` lives in `utilities/macros.hpp`, include that one directly? src/hotspot/share/c1/c1_Compilation.hpp line 134: > 132: > 133: static Compilation* current() { > 134: DEBUG_ONLY(ciEnv::current()->check_compiler_data_c1_or_null();) So, can it be just some sort of: assert(CompilerThread::current()->compiler()->is_c1(), "sanity"); ...without any other changes in `ciEnv`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23635#discussion_r1956304092 PR Review Comment: https://git.openjdk.org/jdk/pull/23635#discussion_r1956311599 From epeter at openjdk.org Fri Feb 14 15:29:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 15:29:14 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: <6bNEQk9vpUjHGLQUfnl_4bAPzoaU99oLWe0EICKTUJM=.e896ddc6-6f3b-4681-82fd-1d86b5d89d7b@github.com> Message-ID: On Fri, 14 Feb 2025 15:24:16 GMT, Jasmine Karthikeyan wrote: >> Shouln't L.6230 read as follows? >> >> // LZCNT = 31 - (biased_exp - 127) > >> Shouln't L.6230 read as follows? >> >> ``` >> // LZCNT = 31 - (biased_exp - 127) >> ``` > > @rgiulietti I believe you are correct, the comment there is wrong. I think it is a typo, it might be because the intrinsic logic computes `LZCNT = 32 - ((biased_exp - 127) + 1)` which is equivalent. I'll update the comments to make this more clear. @jaskarth > Thanks for the detailed analysis! I just checked the Vector API, and I believe it's not affected. - For the **floating-issue** of this PR: Vector API **IS AFFECTED**. See my example above. - For **truncation** issue: Vector API is **NOT** affected. For normal Java uses of `Integer.numberOfLeadingZeros` and `Integer.numberOfTrailingZeros`, we operate explicitly on 32 bits, no matter if the input is byte, short or int. So we must use some 32bit implementation of `CountLeadingZerosV`, no matter if it is on byte, short or int. But the VectorAPI explicitly operates on 8bit bytes, 16bit shorts and 32bit ints, and can use 8bit, 16bit and 32bit variants respectively of `CountLeadingZerosV`. Note there are other affected operations for the **truncation** issue. `Integer.reverse` is also affected, `Integer.numberOfTrailingZeros` too, and maybe others. Fix alternatives for the SuperWord **truncation** issue: - Disable the affected operations in vectorizer (easiest to implement, but possible performance regression) - Cast to int, do operations in int, cast result back. - Implement special operations in the backend that know that they are operating with implicit 32 bits, even if the data is byte or short. So that would be 32-8 implicit leading zeros for byte, or 32-16 implicit leading zeros for short. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659617548 From epeter at openjdk.org Fri Feb 14 15:29:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 15:29:15 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic Here the reproducer for `Integer.reverse` **truncation** issue. public class TestShort { static short[] vals = new short[1024]; static short[] results = new short[1024]; static short v = (short)0x01234567; static short r = (short)Integer.reverse(v); public static void test() { for (int i = 0; i < 1024; ++i) { //results[i] = (short)Integer.numberOfLeadingZeros(vals[i]); //results[i] = (short)Integer.numberOfTrailingZeros(vals[i]); results[i] = (short)Integer.reverse(vals[i]); } } public static void main(String[] args) { for (int i = 0; i < 1024; ++i) { vals[i] = v; } for (int j = 0; j < 10_000; ++j) { test(); for (int i = 0; i < 1024; ++i) { if (results[i] != r) throw new RuntimeException("Wrong result " + results[i] + " at " + i + " expected " + r); } } } } java -Xbatch -XX:-TieredCompilation -XX:UseAVX=2 -XX:CompileCommand=compileonly,TestShort::test -XX:+TraceNewVectors TestShort.java CompileCommand: compileonly TestShort.test bool compileonly = true TraceNewVectors [AutoVectorization]: 980 LoadVector === 813 885 864 [[ ]] @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched #vectory (does not depend only on test, unknown control) TraceNewVectors [AutoVectorization]: 981 ReverseV === _ 980 [[ ]] #vectory TraceNewVectors [AutoVectorization]: 982 StoreVector === 882 885 866 981 [[ ]] @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched Memory: @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; Exception in thread "main" java.lang.RuntimeException: Wrong result -6494 at 12 expected 0 at TestShort.main(TestShort.java:48) With `-Xint` this test passes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659622652 From jkarthikeyan at openjdk.org Fri Feb 14 15:29:13 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 14 Feb 2025 15:29:13 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <6bNEQk9vpUjHGLQUfnl_4bAPzoaU99oLWe0EICKTUJM=.e896ddc6-6f3b-4681-82fd-1d86b5d89d7b@github.com> References: <6bNEQk9vpUjHGLQUfnl_4bAPzoaU99oLWe0EICKTUJM=.e896ddc6-6f3b-4681-82fd-1d86b5d89d7b@github.com> Message-ID: On Thu, 13 Feb 2025 16:38:09 GMT, Raffaello Giulietti wrote: > Shouln't L.6230 read as follows? > > ``` > // LZCNT = 31 - (biased_exp - 127) > ``` @rgiulietti I believe you are correct, the comment there is wrong. I think it is a typo, it might be because the intrinsic logic computes `LZCNT = 32 - ((biased_exp - 127) + 1)` which is equivalent. I'll update the comments to make this more clear. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659616333 From jkarthikeyan at openjdk.org Fri Feb 14 15:55:15 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 14 Feb 2025 15:55:15 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 15:26:59 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > Here the reproducer for `Integer.reverse` **truncation** issue. > > > public class TestShort { > > static short[] vals = new short[1024]; > static short[] results = new short[1024]; > static short v = (short)0x01234567; > static short r = (short)Integer.reverse(v); > > public static void test() { > for (int i = 0; i < 1024; ++i) { > //results[i] = (short)Integer.numberOfLeadingZeros(vals[i]); > //results[i] = (short)Integer.numberOfTrailingZeros(vals[i]); > results[i] = (short)Integer.reverse(vals[i]); > } > } > > public static void main(String[] args) { > for (int i = 0; i < 1024; ++i) { > vals[i] = v; > } > for (int j = 0; j < 10_000; ++j) { > test(); > for (int i = 0; i < 1024; ++i) { > if (results[i] != r) throw new RuntimeException("Wrong result " + results[i] + " at " + i + " expected " + r); > } > } > } > } > > > > java -Xbatch -XX:-TieredCompilation -XX:UseAVX=2 -XX:CompileCommand=compileonly,TestShort::test -XX:+TraceNewVectors TestShort.java > CompileCommand: compileonly TestShort.test bool compileonly = true > TraceNewVectors [AutoVectorization]: 980 LoadVector === 813 885 864 [[ ]] @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched #vectory (does not depend only on test, unknown control) > TraceNewVectors [AutoVectorization]: 981 ReverseV === _ 980 [[ ]] #vectory > TraceNewVectors [AutoVectorization]: 982 StoreVector === 882 885 866 981 [[ ]] @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched Memory: @short[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; > Exception in thread "main" java.lang.RuntimeException: Wrong result -6494 at 12 expected 0 > at TestShort.main(TestShort.java:48) > > > With `-Xint` this test passes. @eme64 Ah I meant Vector API wasn't impacted for the byte/short cases you mentioned earlier, I sent my comment before I saw your reply about the ints ? But it would make sense that the ints are broken, since they emit the same IR. The other operations being impacted by the truncation issue too is a bit worrying. I tested really quickly on an aarch64 device and the cases in your `TestShort.java` broke there too, just to confirm that it's not a backend specific issue. >From your potential solutions maybe it would be best to temporarily disable the vectorization now to make sure we're not miscompiling, and implement option 2 when we have #23413 by always emitting an `int` vector and letting the autovectorizer fill in the casts. Option 3 likely would be the best performance wise, but would increase the maintenance burden of needing 2 different code paths for the same operation. It could be done later if a need for better performance in this edge case arises. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659688053 From roland at openjdk.org Fri Feb 14 15:57:54 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 14 Feb 2025 15:57:54 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: References: Message-ID: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: - review - review - review - Merge branch 'master' into JDK-8349361 - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Emanuel Peter - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Emanuel Peter - review - Update src/hotspot/share/opto/mulnode.hpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> - ... and 5 more: https://git.openjdk.org/jdk/compare/a579b9cd...5b05d222 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/0f1b76ab..5b05d222 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=07-08 Stats: 19479 lines in 943 files changed: 11149 ins; 4291 del; 4039 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From shade at openjdk.org Fri Feb 14 15:58:14 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 14 Feb 2025 15:58:14 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer In-Reply-To: References: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> Message-ID: <9URkr_mh4y57sOCPWUAzBRVaN61BifQCX9yXWyssA_c=.2d97451c-34b9-4072-9fa7-ba4f7ba84692@github.com> On Fri, 14 Feb 2025 15:18:03 GMT, Aleksey Shipilev wrote: >> Somewhat trivial. >> >> I recently hunted a bug for an hour until I realized that I had accessed ciEnv::compiler_data() as C2 `Compile` when, in fact, it was C1 `Compilation`. Stupid mistake, but an assert is easy to do and saves time. > > src/hotspot/share/c1/c1_Compilation.hpp line 134: > >> 132: >> 133: static Compilation* current() { >> 134: DEBUG_ONLY(ciEnv::current()->check_compiler_data_c1_or_null();) > > So, can it be just some sort of: > > > assert(CompilerThread::current()->compiler()->is_c1(), "sanity"); > > > ...without any other changes in `ciEnv`? Context: I think `ciEnv` is pretty compiler-agnostic, and it would be better to avoid exposing the fact C1/C2 exist to that interface, even if only for asserts. Seems cleaner to check that we are calling `Compilation::current()` from C1 and `Compile::current()` from C2. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23635#discussion_r1956372247 From roland at openjdk.org Fri Feb 14 15:57:54 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 14 Feb 2025 15:57:54 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v7] In-Reply-To: References: <3cT_HJ9dj5J4NFrLzmvYUdUy4uee6Ltcm6d20YP3jm0=.aa20c25e-c097-4e59-9d82-12aa2c3b4422@github.com> Message-ID: On Thu, 13 Feb 2025 10:46:25 GMT, Emanuel Peter wrote: > I left some comments / suggestions. Thanks for reviewing this. I pushed new commits that I think cover your comments. > I'm also wondering about testing. How good do you think test coverage is? Are all cases covered? How about the edge-cases? Could we improve the coverage with randomization somehow? Transformations that I refactored and now apply to both the long and int `RShift` should now have test cases. I tweaked the test cases so there are now randomized. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23438#issuecomment-2659694153 From epeter at openjdk.org Fri Feb 14 16:29:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 16:29:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic So far I found 4 operations that have issues with **truncation**. // java -Xbatch -XX:UseAVX=2 -XX:CompileCommand=compileonly,TestShort::test -XX:CompileCommand=printcompilation,TestShort::test -XX:CompileCommand=TraceAutoVectorization,TestShort::test,SW_REJECTIONS -XX:+TraceNewVectors TestShort.java import java.util.Random; public class TestShort { static Random RANDOM = new Random(); static int SIZE = 16 * 1024; static short[] a = new short[SIZE]; static short[] r = new short[SIZE]; static short[] gold = new short[SIZE]; public static void test(short[] out) { for (int i = 0; i < a.length; ++i) { // List of problematic operations, where truncation happens: //out[i] = (short)Integer.numberOfLeadingZeros(a[i]); //out[i] = (short)Integer.numberOfTrailingZeros(a[i]); //out[i] = (short)Integer.reverse(a[i]); out[i] = (short)Integer.bitCount(a[i]); // Seem ok, maybe because they do not vectorize //out[i] = (short)Integer.signum(a[i]); //out[i] = (short)Integer.highestOneBit(a[i]); //out[i] = (short)Integer.lowestOneBit(a[i]); // While we are at it, we should also have tests for this, even though it currently does not vectorize, // but it may in the future and then we have to catch the truncation. // out[i] = (short)Long.bitCount(a[i]); } } public static void main(String[] args) { for (int i = 0; i < a.length; ++i) { a[i] = (short)RANDOM.nextInt(); } System.out.println("Compute gold:"); test(gold); System.out.println("Compile and compute:"); for (int j = 0; j < 1000; ++j) { test(r); } System.out.println("Verify:"); for (int i = 0; i < a.length; ++i) { if (gold[i] != r[i]) throw new RuntimeException("Wrong result " + r[i] + " vs " + gold[i] + " for " + a[i] + " at " + i); } } } ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659767156 From jkarthikeyan at openjdk.org Fri Feb 14 16:29:12 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 14 Feb 2025 16:29:12 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: <9MAYs_8TznDWmkZ-e4vUh666VC8Ju0Xe-A0qAQOyLH0=.7f97f41c-be3e-4138-8ec8-254b0f610494@github.com> On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic It seems we had to fix this before for `Integer.reverseBytes`: https://bugs.openjdk.org/browse/JDK-8305324 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659767292 From psandoz at openjdk.org Fri Feb 14 16:34:14 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 14 Feb 2025 16:34:14 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: <-4utUH59VaPXdk9h2C77CDimuHC0qIbMTWzWb9NZ0AQ=.9335e5a6-9721-48ca-95fd-75430c853111@github.com> On Thu, 13 Feb 2025 19:23:51 GMT, Paul Sandoz wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > Thank you for fixing this. More broadly we should double check the intrinsics of Long/Integer.numberOfLeading/Trailing/Zeros (we added in the integration of Integration of JEP 426: Vector API) and follow up with any necessary tests and/or fixes in subsequent PRs. > @PaulSandoz @jatin-bhateja This should really have been caught by VectorAPI testing, but apparently not the whole range was covered. Correct, there are tests that cover the integral vectors and masked variants, and there are various input data shapes generated but I don't think they generate the right values to provoke the issue. They do compare against the scalar method call (and i suspect C2 does not optimize given how results are compared - we could strengthen this by restricting inlining when asserting results). https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#L5967 https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte256VectorTests.java#L5923 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659778515 From roland at openjdk.org Fri Feb 14 16:36:10 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 14 Feb 2025 16:36:10 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 17:19:45 GMT, Quan Anh Mai wrote: > > For any `Phi`? This seems like an issue that's specific to the counted loop iv. I don't think we want to add `CastII` nodes unless we're sure they are needed. > > I think it is probably because we are more aggressive with the type of loop phis. Conceptually, a `Phi` is pinned, and idealizing it into a floating node is incorrect. So, there may be issues lurking around due to this. We can introduce another phase that tries to remove all `CastNode` and pins all nodes that need that dependency (such as `Div`, loads, etc). What do you think? In general a `Phi`'s type is the union of the types of its inputs. There's no narrowing happening at the `Phi`. So I don't think there's any implicit dependency from some floating node that uses the `Phi` on the `Phi`'s `Region`. Counted loops are a bit different: the loop's `Phi` is the union of the type of the value on loop entry and a narrowed type for the value flowing through the backedge. The loop exit condition is what narrows the type the value that flows along the backedge and, as a result, of the loop `Phi`. Trouble occurs when the loop is cloned by pre/main/post. Then the `Phi` for the main or post loop inherits the type of the `Phi` before transformation, the one that depends on the loop exit condition. Say the pre loop runs iterations 0 to p and the main loop runs iteration p+1 to m. In the loop before transformation, to reach iteration p+1, we have to go through the backedge at least once. One pre/main/post loops are created, going through the backedge to reach iteration p+1 implies exiting the pre loop and passing the zero trip guard for the main loop. So the type of the main loop's `Phi` is dependent on the zero trip guard for the main loop which is why we need the `CastII`. AFAICT, this issue really only exists for counted loop `Phi`s and once some transformation has "unfolded" the loop exit test. Now that I think more about it, a similar issue may happen with peeling (I will look into that separately). But I don't think it's a generic issue that affects all `Phi`s and as such it should be fixed by adding extra dependencies where they are needed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23617#issuecomment-2659784027 From epeter at openjdk.org Fri Feb 14 16:41:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Feb 2025 16:41:14 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <9MAYs_8TznDWmkZ-e4vUh666VC8Ju0Xe-A0qAQOyLH0=.7f97f41c-be3e-4138-8ec8-254b0f610494@github.com> References: <9MAYs_8TznDWmkZ-e4vUh666VC8Ju0Xe-A0qAQOyLH0=.7f97f41c-be3e-4138-8ec8-254b0f610494@github.com> Message-ID: On Fri, 14 Feb 2025 16:26:13 GMT, Jasmine Karthikeyan wrote: > It seems we had to fix this before for Integer.reverseBytes: https://bugs.openjdk.org/browse/JDK-8305324 @jaskarth nice find! Sadly we did not find the other affected operations at the time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2659793625 From aph-open at littlepinkcloud.com Fri Feb 14 18:01:23 2025 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Fri, 14 Feb 2025 18:01:23 +0000 Subject: RFD: Subsampled profile counters in HotSpot Message-ID: <0bbafb6e-c9f9-4c16-a278-068b5082c3e2@littlepinkcloud.com> This is JDK-8134940: TieredCompilation profiling can exhibit poor scalability. (Thanks to Igor Veresov for the inspiration and advice!) The poster child for awful profile scaling is SOR, https://cr.openjdk.org/~redestad/scratch/sor-jmh/ Here is the TieredStopAtLevel=3 performance with just one thread: Benchmark Mode Cnt Score Error Units JGSOR.test avgt 3 9.144 ? 1.044 ms/op and with 16 hardware threads: JGSOR.test avgt 3 1177.982 ? 5.108 ms/op So it's a 100-fold performance drop. This is a real problem I've seen in production deployments. I've been looking at the idea of incrementing profile counters less frequently, recording events in a pseudo-random subsampled way. For example, you could set the ProfileCaptureRatio=16 and then, at random, only 1/16th of counter updates would be recorded. In theory, those 1/16th of updates should be representative, but theory is not necessarily the same as practice. Here's where Statistics comes to our rescue, though. I am not a statistician, but I think the central theorem of statistics applies. It "describes the asymptotic behaviour of the empirical distribution function as the number of independent and identically distributed observations grows. Specifically, the empirical distribution function converges uniformly to the true distribution function almost surely." (It doesn't say how long that'll take, though.) So, as long as the random-number generator we use is completely uncorrelated with the process we're profiling, even a substantially undersampled set of profile counters should converge to the same ratios we'd have if not undersampling. The poster child for awful profile scaling is SOR, https://cr.openjdk.org/~redestad/scratch/sor-jmh/ Here is the TieredStopAtLevel=3 performance with just one thread: Benchmark Mode Cnt Score Error Units JGSOR.test avgt 3 9.144 ? 1.044 ms/op and with 16 hardware threads: JGSOR.test avgt 3 1177.982 ? 5.108 ms/op So it's a 100-fold performance drop. I have done a very rough proof-of-concept implementation of subsampling. It's at https://github.com/openjdk/jdk/pull/23643/files It's not fit for anything much except to demonstrate the feasibility of using this approach. While the code isn't great, I think that it does fairly represent the performance we could expect if we decided to go with this approach. These are the JMH results for SOR, 16 threads with various subsampling ratios, controlled by -XX:ProfileCaptureRatio=n: n Benchmark Mode Cnt Score Error Units 1 JGSOR.test avgt 3 1177.982 ? 5.108 ms/op 2 JGSOR.test avgt 3 622.435 ? 101.466 ms/op 4 JGSOR.test avgt 3 310.496 ? 17.681 ms/op 8 JGSOR.test avgt 3 170.867 ? 0.911 ms/op 16 JGSOR.test avgt 3 98.210 ? 9.236 ms/op 32 JGSOR.test avgt 3 58.137 ? 3.501 ms/op 64 JGSOR.test avgt 3 35.384 ? 0.922 ms/op 128 JGSOR.test avgt 3 22.076 ? 0.197 ms/op 256 JGSOR.test avgt 3 15.459 ? 2.312 ms/op 1024 JGSOR.test avgt 3 10.180 ? 0.426 ms/op With n=1. there is no undersampling at all, and we see the catastrophic slowdown which is the subject of this bug report. The performance improves rapidly, but not quite linearly, with increasing subsampling ratios, as you'd expect. /build/linux-x86_64-server-release/jdk/bin/java -jar ./build/linux-x86_64-server-release/images/test/micro/benchmarks.jar SOR -t 16 -f 1 -wi 3 -i 3 -r 1 -w 1 -jvmArgs ' -XX:TieredStopAtLevel=3 -XX:ProfileCaptureRatio=16' Surprisingly, the overhead for randomized subsapling isn't so great. Here's the speed of the same JGSOR.test with only 1 thread, with various subsampling ratios: 1 JGSOR.test avgt 3 9.087 ? 0.041 ms/op (not undersampled) 2 JGSOR.test avgt 3 22.431 ? 0.079 ms/op 4 JGSOR.test avgt 3 14.291 ? 0.048 ms/op 8 JGSOR.test avgt 3 10.316 ? 0.021 ms/op 16 JGSOR.test avgt 3 9.360 ? 0.022 ms/op 32 JGSOR.test avgt 3 9.196 ? 0.042 ms/op We can see that if we undersample 16-fold, then the single-threaded overhead for profile counting is no worse than it is with no undersampling at all. We could, at least in theory, ship HotSpot with 32-fold undersampling as a default, and no one would ever notice, except that the poor scaling behaviour would be very much reduced. However, there is a cost. C1 code size does increase because we have to step the random-number generator at every profiling site. It's not too bad, because the added code is just something like mov ebx,0x41c64e6d imul r14d,ebx add r14d,0x3039 cmp r14d,0x1000000 // ProfileCaptureRatio jae 0x00007fffe099c197 ... profiling code but it's definitely bigger. The POC is C1 only, x86 only, and I haven't done anything about profiling the interpreter. I'm sure it has bugs. It'll probably crash if you push it too hard. But is very representative, I think, of how well a finished implementation would perform. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From kvn at openjdk.org Fri Feb 14 18:00:10 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 18:00:10 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer In-Reply-To: <9URkr_mh4y57sOCPWUAzBRVaN61BifQCX9yXWyssA_c=.2d97451c-34b9-4072-9fa7-ba4f7ba84692@github.com> References: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> <9URkr_mh4y57sOCPWUAzBRVaN61BifQCX9yXWyssA_c=.2d97451c-34b9-4072-9fa7-ba4f7ba84692@github.com> Message-ID: On Fri, 14 Feb 2025 15:55:43 GMT, Aleksey Shipilev wrote: >> src/hotspot/share/c1/c1_Compilation.hpp line 134: >> >>> 132: >>> 133: static Compilation* current() { >>> 134: DEBUG_ONLY(ciEnv::current()->check_compiler_data_c1_or_null();) >> >> So, can it be just some sort of: >> >> >> assert(CompilerThread::current()->compiler()->is_c1(), "sanity"); >> >> >> ...without any other changes in `ciEnv`? > > Context: I think `ciEnv` is pretty compiler-agnostic, and it would be better to avoid exposing the fact C1/C2 exist to that interface, even if only for asserts. Seems cleaner to check that we are calling `Compilation::current()` from C1 and `Compile::current()` from C2. I agree with Aleksey here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23635#discussion_r1956531575 From vladimir.kozlov at oracle.com Fri Feb 14 18:14:07 2025 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 14 Feb 2025 10:14:07 -0800 Subject: RFD: Subsampled profile counters in HotSpot In-Reply-To: <0bbafb6e-c9f9-4c16-a278-068b5082c3e2@littlepinkcloud.com> References: <0bbafb6e-c9f9-4c16-a278-068b5082c3e2@littlepinkcloud.com> Message-ID: Thank you, Andrew, for sharing this data from your experiment. This is indeed very promising. Thanks, Vladimir K On 2/14/25 10:01 AM, Andrew Haley wrote: > This is JDK-8134940: TieredCompilation profiling can exhibit poor > scalability. > > (Thanks to Igor Veresov for the inspiration and advice!) > > The poster child for awful profile scaling is SOR, > https://cr.openjdk.org/~redestad/scratch/sor-jmh/ > > Here is the TieredStopAtLevel=3 performance with just one thread: > > Benchmark?? Mode? Cnt? Score?? Error? Units > JGSOR.test? avgt??? 3? 9.144 ? 1.044? ms/op > > and with 16 hardware threads: > > JGSOR.test? avgt??? 3? 1177.982 ? 5.108? ms/op > > So it's a 100-fold performance drop. This is a real problem I've > seen in production deployments. > > I've been looking at the idea of incrementing profile counters less > frequently, recording events in a pseudo-random subsampled way. For > example, you could set the ProfileCaptureRatio=16 and then, at random, > only 1/16th of counter updates would be recorded. > > In theory, those 1/16th of updates should be representative, but > theory is not necessarily the same as practice. Here's where > Statistics comes to our rescue, though. I am not a statistician, but I > think the central theorem of statistics applies. It "describes the > asymptotic behaviour of the empirical distribution function as the > number of independent and identically distributed observations grows. > Specifically, the empirical distribution function converges uniformly > to the true distribution function almost surely." (It doesn't say how > long that'll take, though.) > > So, as long as the random-number generator we use is completely > uncorrelated with the process we're profiling, even a substantially > undersampled set of profile counters should converge to the same > ratios we'd have if not undersampling. > > The poster child for awful profile scaling is SOR, > https://cr.openjdk.org/~redestad/scratch/sor-jmh/ > > Here is the TieredStopAtLevel=3 performance with just one thread: > > Benchmark?? Mode? Cnt? Score?? Error? Units > JGSOR.test? avgt??? 3? 9.144 ? 1.044? ms/op > > and with 16 hardware threads: > > JGSOR.test? avgt??? 3? 1177.982 ? 5.108? ms/op > > So it's a 100-fold performance drop. > > I have done a very rough proof-of-concept implementation of > subsampling. It's at > https://github.com/openjdk/jdk/pull/23643/files > It's not fit for anything much except to demonstrate the feasibility > of using this approach. While the code isn't great, I think that it > does fairly represent the performance we could expect if we decided to > go with this approach. > > These are the JMH results for SOR, 16 threads with various subsampling > ratios, controlled by -XX:ProfileCaptureRatio=n: > > n?? Benchmark?? Mode? Cnt?? Score?? Error? Units > > 1?? JGSOR.test? avgt??? 3? 1177.982 ? 5.108? ms/op > 2?? JGSOR.test? avgt??? 3? 622.435 ? 101.466? ms/op > 4?? JGSOR.test? avgt??? 3? 310.496 ? 17.681? ms/op > 8?? JGSOR.test? avgt??? 3? 170.867 ? 0.911? ms/op > 16? JGSOR.test? avgt??? 3? 98.210 ? 9.236? ms/op > 32? JGSOR.test? avgt??? 3? 58.137 ? 3.501? ms/op > 64? JGSOR.test? avgt??? 3? 35.384 ? 0.922? ms/op > 128 JGSOR.test? avgt??? 3? 22.076 ? 0.197? ms/op > 256 JGSOR.test? avgt??? 3? 15.459 ? 2.312? ms/op > 1024 JGSOR.test avgt??? 3? 10.180 ? 0.426? ms/op > > With n=1. there is no undersampling at all, and we see the > catastrophic slowdown which is the subject of this bug report. The > performance improves rapidly, but not quite linearly, with increasing > subsampling ratios, as you'd expect. > > /build/linux-x86_64-server-release/jdk/bin/java -jar ./build/linux- > x86_64-server-release/images/test/micro/benchmarks.jar SOR -t 16 -f 1 - > wi 3 -i 3 -r 1 -w 1 -jvmArgs ' -XX:TieredStopAtLevel=3 - > XX:ProfileCaptureRatio=16' > > Surprisingly, the overhead for randomized subsapling isn't so great. > Here's the speed of the same JGSOR.test with only 1 thread, with > various subsampling ratios: > > 1?? JGSOR.test? avgt??? 3? 9.087 ? 0.041? ms/op (not undersampled) > 2?? JGSOR.test? avgt??? 3? 22.431 ? 0.079? ms/op > 4?? JGSOR.test? avgt??? 3? 14.291 ? 0.048? ms/op > 8?? JGSOR.test? avgt??? 3? 10.316 ? 0.021? ms/op > 16? JGSOR.test? avgt??? 3? 9.360 ? 0.022? ms/op > 32? JGSOR.test? avgt??? 3? 9.196 ? 0.042? ms/op > > We can see that if we undersample 16-fold, then the single-threaded > overhead for profile counting is no worse than it is with no > undersampling at all. We could, at least in theory, ship HotSpot with > 32-fold undersampling as a default, and no one would ever notice, > except that the poor scaling behaviour would be very much reduced. > > However, there is a cost. C1 code size does increase because we have > to step the random-number generator at every profiling site. It's not > too bad, because the added code is just something like > > ?? mov??? ebx,0x41c64e6d > ?? imul?? r14d,ebx > ?? add??? r14d,0x3039 > ?? cmp??? r14d,0x1000000?? // ProfileCaptureRatio > ?? jae??? 0x00007fffe099c197 > ?? ... profiling code > > but it's definitely bigger. > > The POC is C1 only, x86 only, and I haven't done anything about > profiling the interpreter. I'm sure it has bugs. It'll probably crash > if you push it too hard. But is very representative, I think, of how > well a finished implementation would perform. > From qamai at openjdk.org Fri Feb 14 18:27:14 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 14 Feb 2025 18:27:14 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop [v2] In-Reply-To: References: Message-ID: <52OYoC5__FdcN8OLwVgdNlb6Fz_IFo8UyKy3GUp5DiM=.708f1ee8-dbbb-4abf-8de0-d94b3b1e2ef6@github.com> On Thu, 13 Feb 2025 16:57:30 GMT, Roland Westrelin wrote: >> The test crashes because of a division by zero. The `Div` node for >> that one is initially part of a counted loop. The control input of the >> node is cleared because the divisor is non zero. This is because the >> divisor depends on the loop phi and the type of the loop phi is >> narrowed down when the counted loop is created. pre/main/post loops >> are created, unrolling happens, the main loop looses its backedge. The >> `Div` node can then float above the zero trip guard for the main >> loop. When the zero trip guard is not taken, there's no guarantee the >> divisor is non zero so the `Div` node should be pinned below it. >> >> I propose we revert the change I made with 8334724 which removed >> `PhaseIdealLoop::cast_incr_before_loop()`. The `CastII` that this >> method inserted was there to handle exactly this problem. It was added >> initially for a similar issue but with array loads. That problem with >> loads is handled some other way now and that's why I thought it was >> safe to proceed with the removal. >> >> The code in this patch is somewhat different from the one we had >> before for a couple reasons: >> >> 1- assert predicate code evolved and so previous logic can't be >> resurrected as it was. >> >> 2- the previous logic has a bug. >> >> Regarding 1-: during pre/main/post loop creation, we used to add the >> `CastII` and then to add assertion predicates (so assertion predicates >> depended on the `CastII`). Then when unrolling, when assertion >> predicates are updated, we would skip over the `CastII`. What I >> propose here is to add the `CastII` after assertion predicates are >> added. As a result, they don't depend on the `CastII` and there's no >> need for any extra logic when unrolling happens. This, however, >> doesn't work when the assertion predicates are added by RCE. In that >> case, I had to add logic to skip over the `CastII` (similar to what >> existed before I removed it). >> >> Regarding 2-: previous implementation for >> `PhaseIdealLoop::cast_incr_before_loop()` would add the `CastII` at >> the first loop `Phi` it encounters that's a use of the loop increment: >> it's usually the iv but not always. I tweaked the test case to show, >> this bug can actually cause a crash and changed the logic for >> `PhaseIdealLoop::cast_incr_before_loop()` accordingly. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge branch 'master' into JDK-8349139 > - fix & test Hmmm, may be you are right. I think adding a comment at `PhiNode` saying that people must not rely on it being pinned at the `Region` for dependencies would be a wise move, I can't think of any reason for that besides value narrowing right now but being pinned is a property of `Phi` regardless and we should tell people not to rely on this behaviour. For this bug, I think a more general fix is to try to compare the type of the `Phi` with that of the input it is going to be replaced with. If the former is not wider than the latter then we add a `CastNode`, since the cast is only about value range, not strict dependency, we can use `CarryDependency` instead of `UnconditionalDependency`. Am I right? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23617#issuecomment-2659999245 From stuefe at openjdk.org Fri Feb 14 18:53:11 2025 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 14 Feb 2025 18:53:11 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer In-Reply-To: References: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> <9URkr_mh4y57sOCPWUAzBRVaN61BifQCX9yXWyssA_c=.2d97451c-34b9-4072-9fa7-ba4f7ba84692@github.com> Message-ID: On Fri, 14 Feb 2025 17:57:31 GMT, Vladimir Kozlov wrote: >> Context: I think `ciEnv` is pretty compiler-agnostic, and it would be better to avoid exposing the fact C1/C2 exist to that interface, even if only for asserts. Seems cleaner to check that we are calling `Compilation::current()` from C1 and `Compile::current()` from C2. > > I agree with Aleksey here. Yes, I like this too. Simpler. Okay, I try that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23635#discussion_r1956588610 From vlivanov at openjdk.org Fri Feb 14 19:30:17 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 14 Feb 2025 19:30:17 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 12 Feb 2025 21:09:31 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > fix bounds checks Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23557#pullrequestreview-2618641095 From vlivanov at openjdk.org Fri Feb 14 19:30:18 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 14 Feb 2025 19:30:18 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> <7xJgm0ScXMp4iRaH7Sf5QfsrTv2jOV4078kPqn3aoCs=.63303086-b4bd-47c5-9bd5-e69e28f75f4c@github.com> Message-ID: On Thu, 13 Feb 2025 21:23:55 GMT, Dean Long wrote: >> As far as I can tell, it was never needed. If an invokedynamic or invokehandle adds an appendix, then it will show up in the callee, and will be reflected in the caller args size, so there is no mismatch. As far as the JVM is concerned, an invokedynamic/invokehandle looks like a call to a JVM-generated adapter. The only way for invokedynamic/invokehandle to cause an argument mismatch is if the JVM resolved the call-site to an adapter that was actually a MethodHandle linker. That is the exception I describe in the comment below. If we ever allowed the JVM to do that, then several other checks would also need to be fixed. >> For the record, this code used to call cur.is_method_handle_invoke(), which was also wrong, but at least it had a name closer to what we would want. Ideally, something like is_method_handle_linker_invoke() that checks for linkToVirtual, linkToStatic, linkToSpecial, and linkToInterface would have been better. >> The old comment about "arbitrary chains of calls" seems to be left over from an early JSR292 feature known as Ricochet Frames. > > For the curious, it is still possible create an arbitrarily long chain of linkTo calls, but only trusted code would be able to do that, so I'm not addressing this issue in this PR. Thanks for the clarifications! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1956626058 From rgiulietti at openjdk.org Fri Feb 14 21:29:16 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Fri, 14 Feb 2025 21:29:16 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <9MAYs_8TznDWmkZ-e4vUh666VC8Ju0Xe-A0qAQOyLH0=.7f97f41c-be3e-4138-8ec8-254b0f610494@github.com> References: <9MAYs_8TznDWmkZ-e4vUh666VC8Ju0Xe-A0qAQOyLH0=.7f97f41c-be3e-4138-8ec8-254b0f610494@github.com> Message-ID: On Fri, 14 Feb 2025 16:26:13 GMT, Jasmine Karthikeyan wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > It seems we had to fix this before for `Integer.reverseBytes`: https://bugs.openjdk.org/browse/JDK-8305324 @jaskarth I think the implementation can be simplified like so, regardless of whether `src` < 2^P (P = 24 for `float`): ... t = src >>> 1 t = ~t & src dst = (float) t ... This ensures that the leading 1 maintains its position and that the bit immediately to its right (if any) is set to 0. The other bits further to the right are irrelevant. In this way, the subsequent conversion to `float` cannot "cross the bit boundary", as you express in the initial paragraph, while still maintaining the biased exponent intact. WDYT? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2660300425 From dlong at openjdk.org Fri Feb 14 22:41:13 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 14 Feb 2025 22:41:13 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 12 Feb 2025 21:09:31 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > fix bounds checks > I think you can make the assertion a little stricter like this [reinrich at 9c3c8a3](https://github.com/reinrich/jdk/commit/9c3c8a33a29b9ae6c4c703992b306dc0cbbcd2f0). Regarding this stricter version, why are you using is_bottom_frame instead of is_top_frame? The deoptimization code seems to name the most recent leaf frame "top". That sounds like what frame::top_ijava_frame_abi_size is for too. Thanks for the review, Vladimir. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2660395298 PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2660398727 From chagedorn at openjdk.org Fri Feb 14 23:13:12 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 14 Feb 2025 23:13:12 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 [v2] In-Reply-To: References: Message-ID: <_xyEVieq-sMscSn6pUI4yvwDuRHVqs-6JF10iDCNkqY=.39a4115c-acce-4b25-850f-42de94aff0cf@github.com> On Wed, 12 Feb 2025 19:50:51 GMT, Aleksey Shipilev wrote: >> Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: >> >> >> ... >> [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 >> [99] javax.enterprise.deploy.shared.DConfigBeanVersionType >> [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 >> [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 >> ... >> >> >> I narrowed it down to level downgrade in compilation policy here: >> https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 >> >> [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. >> >> I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. >> >> I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Shortcut CTW tasks directly Sorry only seen this now. Looks good to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23589#pullrequestreview-2618985350 From kvn at openjdk.org Fri Feb 14 23:14:18 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 23:14:18 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument I addressed most @xmas92 and @dean-long comments and working on avoid `_v` suffix Thank you, Dean, for review. ------------- PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2618707275 PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2660443983 From kvn at openjdk.org Fri Feb 14 23:14:20 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 23:14:20 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 08:15:16 GMT, Axel Boldt-Christmas wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero VM build > > src/hotspot/share/code/codeBlob.hpp line 140: > >> 138: instance->print_value_on_nv(st); >> 139: } >> 140: }; > > I wonder why the base class is not abstract. AFAICT `print_value_on` is unreachable and `print_on` is only used by `DeoptimizationBlob::Vptr` which also seems like a behavioural change, as before this patch calling `print_on` a `DeoptimizationBlob` object would dispatch to `SingletonBlob::print_on` not `CodeBlob::print_on`. > > Suggestion: > > struct Vptr { > virtual void print_on(const CodeBlob* instance, outputStream* st) const = 0; > virtual void print_value_on(const CodeBlob* instance, outputStream* st) const = 0; > }; done > src/hotspot/share/code/codeBlob.hpp line 339: > >> 337: void print_value_on(outputStream* st) const; >> 338: >> 339: class Vptr : public CodeBlob::Vptr { > > I wonder if these should share the same type hierarchy as tier container class. This would also solve the issueI noted in my other comment about not calling the correct `print_on`. > Suggestion: > > class Vptr : public RuntimeBlob::Vptr { Fixed > src/hotspot/share/code/codeBlob.hpp line 427: > >> 425: void print_value_on(outputStream* st) const; >> 426: >> 427: class Vptr : public CodeBlob::Vptr { > > Suggestion: > > class Vptr : public RuntimeBlob::Vptr { Fixed > src/hotspot/share/code/codeBlob.hpp line 467: > >> 465: void print_value_on(outputStream* st) const; >> 466: >> 467: class Vptr : public CodeBlob::Vptr { > > Suggestion: > > class Vptr : public RuntimeBlob::Vptr { Fixed > src/hotspot/share/code/codeBlob.hpp line 553: > >> 551: void print_value_on(outputStream* st) const; >> 552: >> 553: class Vptr : public CodeBlob::Vptr { > > This one specifically > Suggestion: > > class Vptr : public SingletonBlob::Vptr { fixed > src/hotspot/share/code/codeBlob.hpp line 679: > >> 677: void print_value_on(outputStream* st) const; >> 678: >> 679: class Vptr : public CodeBlob::Vptr { > > Suggestion: > > class Vptr : public RuntimeBlob::Vptr { fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956799673 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956801833 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956801994 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956802109 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956803039 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956827486 From kvn at openjdk.org Fri Feb 14 23:14:22 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 23:14:22 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <_9qiqpCFRxCMY4nADw0lqrNuOZYIKUpeY_7FYyoQWC8=.78588553-bede-45b1-bf2d-5ad306b81e29@github.com> On Fri, 14 Feb 2025 00:08:35 GMT, Dean Long wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> rename SA argument > > src/hotspot/share/code/codeBlob.hpp line 669: > >> 667: >> 668: jobject receiver() { return _receiver; } >> 669: ByteSize frame_data_offset() { return _frame_data_offset; } > > `frame_data_offset()` seems to be unused. removed > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/c1/Runtime1.java line 65: > >> 63: public CodeBlob blobFor(int id) { >> 64: Address blobAddr = blobsField.getStaticFieldAddress().getAddressAt(id * VM.getVM().getAddressSize()); >> 65: return VM.getVM().getCodeCache().createCodeBlobWrapper(blobAddr); > > We don't need to change all the callers if we keep a 1-arg version of createCodeBlobWrapper(): > > public CodeBlob createCodeBlobWrapper(Address codeBlobAddr) { > return createCodeBlobWrapper(codeBlobAddr, codeBlobAddr); > } This is the only one place where arguments are the same. In other two arguments are different. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956672379 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956667806 From kvn at openjdk.org Fri Feb 14 23:14:23 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 23:14:23 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <07aI9gwcVtc89Bte9DRQ6VwmCfhcBJJQlrXhxkRRgX0=.97d4a1cc-92a2-43dc-8516-2433eca67263@github.com> On Thu, 13 Feb 2025 19:27:19 GMT, Chris Plummer wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> rename SA argument > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 97: > >> 95: // cbAddr - address of a code blob >> 96: // cbPC - address inside of a code blob >> 97: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address cbPC) { > > Can you change findBlobUnsafe() above also? That's where the naming problem originated. After some thoughts I think `PC` is not usually used by us. I renamed `cbAddr` to `cbStart` and `cbPC`/`start` to `addr` in this whole file. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956664966 From dlong at openjdk.org Sat Feb 15 01:52:18 2025 From: dlong at openjdk.org (Dean Long) Date: Sat, 15 Feb 2025 01:52:18 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer In-Reply-To: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> References: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> Message-ID: <58vLpCESZNRppmEvNOWy252l_uFxiu9iwE7plEtDopo=.91b6eeef-f902-474b-bf10-8684425647a4@github.com> On Fri, 14 Feb 2025 14:48:34 GMT, Thomas Stuefe wrote: > Somewhat trivial. > > I recently hunted a bug for an hour until I realized that I had accessed ciEnv::compiler_data() as C2 `Compile` when, in fact, it was C1 `Compilation`. Stupid mistake, but an assert is easy to do and saves time. If we derive C2 Compile and C1 Compilation from a common superclass, then ciEnv::compiler_data() could return that superclass and self-identify using a type field or virtual function. When looking into how we handle failure messages and how complicated it is, I thought moving some fields into a common superclass would be useful. Eventually we could even look into making ciEnv the common superclass, because the lifetimes are almost identical. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23635#issuecomment-2660623476 From kvn at openjdk.org Sat Feb 15 02:08:20 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Feb 2025 02:08:20 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 23:01:24 GMT, Dean Long wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> rename SA argument > > src/hotspot/share/runtime/vframe.inline.hpp line 178: > >> 176: INTPTR_FORMAT " not found or invalid at %d", >> 177: p2i(_frame.pc()), decode_offset); >> 178: nm()->print_on_v(&ss); > > I suggest removing _v suffix to reduce changes and match existing naming. Done. Testing now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956985708 From kvn at openjdk.org Sat Feb 15 06:13:57 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Feb 2025 06:13:57 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v9] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Address comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/61fdee68..89a383e5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=07-08 Stats: 115 lines in 12 files changed: 7 ins; 7 del; 101 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Sat Feb 15 06:29:14 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Feb 2025 06:29:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v9] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <2aYXBHyZE83suQFtY_POyft2gbRwwF_Xf_qajA62Pgw=.1fe1143c-33c5-4e78-b691-3f85f176c598@github.com> On Sat, 15 Feb 2025 06:13:57 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address comments I removed `_v` from `CodeBlob::print*_on(st)` methods to reduce scope of VM changes. But I have to add `_impl` suffix to these methods in CodeBlob subclasses. I renamed `nmethod::print_on(st, msg);` to `print_on_with_msg(at, msg) to avoid naming conflict C++ complains about. It cased change in `dependencyContext.cpp`. I made `CodeBlob::Vptr` class abstract as suggested. I added empty `Vptr` class to `RuntimeBlob` because it is referenced in subclasses and corrected extensions in sublcasses to avoid mistakes @xmas92 pointed. I also did some arguments renaming in SA in `CodeCache.java` as requested. Tier1-5 testing passed. Ready for new round of reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2660770028 From kvn at openjdk.org Sat Feb 15 06:34:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Feb 2025 06:34:56 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Remove commented lines left by mistake ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/89a383e5..3fdf1c81 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=08-09 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From stuefe at openjdk.org Sat Feb 15 06:42:09 2025 From: stuefe at openjdk.org (Thomas Stuefe) Date: Sat, 15 Feb 2025 06:42:09 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer In-Reply-To: <58vLpCESZNRppmEvNOWy252l_uFxiu9iwE7plEtDopo=.91b6eeef-f902-474b-bf10-8684425647a4@github.com> References: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> <58vLpCESZNRppmEvNOWy252l_uFxiu9iwE7plEtDopo=.91b6eeef-f902-474b-bf10-8684425647a4@github.com> Message-ID: <7h7JAuQ2R7JssyJJ3noz55YwqYDk5r2zRdadPaGFkjk=.88ee79c2-4e07-43a0-ba20-3e5460d8ead7@github.com> On Sat, 15 Feb 2025 01:49:32 GMT, Dean Long wrote: > If we derive C2 Compile and C1 Compilation from a common superclass, then ciEnv::compiler_data() could return that superclass and self-identify using a type field or virtual function. When looking into how we handle failure messages and how complicated it is, I thought moving some fields into a common superclass would be useful. Eventually we could even look into making ciEnv the common superclass, because the lifetimes are almost identical. I thought about that, but Compile inherits from Phase, so any common superclass becomes one of Phase, too. I did not want to change the inheritance structure to deal with such a tiny problem. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23635#issuecomment-2660773790 From stuefe at openjdk.org Sat Feb 15 07:14:24 2025 From: stuefe at openjdk.org (Thomas Stuefe) Date: Sat, 15 Feb 2025 07:14:24 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer [v2] In-Reply-To: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> References: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> Message-ID: <4gkyBqTfGrrSMDXw96A3V9H-bLHV0D5VFrvmTzu6k3A=.c7094806-2113-4299-b5bf-1de73b9fb05a@github.com> > Somewhat trivial. > > I recently hunted a bug for an hour until I realized that I had accessed ciEnv::compiler_data() as C2 `Compile` when, in fact, it was C1 `Compilation`. Stupid mistake, but an assert is easy to do and saves time. Thomas Stuefe has updated the pull request incrementally with two additional commits since the last revision: - redo - Revert "start" This reverts commit e370e14abf2ee25019ed13cde9edfa24047d982d. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23635/files - new: https://git.openjdk.org/jdk/pull/23635/files/a103b455..b4b1bbd3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23635&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23635&range=00-01 Stats: 52 lines in 6 files changed: 21 ins; 23 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23635.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23635/head:pull/23635 PR: https://git.openjdk.org/jdk/pull/23635 From shade at openjdk.org Sat Feb 15 07:25:15 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Sat, 15 Feb 2025 07:25:15 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 [v2] In-Reply-To: References: Message-ID: <5QEocvkUTi1ASl92dt95LqBhB-PLAAPDp28KWnzXtdg=.b637ad48-aca0-4248-8715-3b711f4acc5f@github.com> On Wed, 12 Feb 2025 19:50:51 GMT, Aleksey Shipilev wrote: >> Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: >> >> >> ... >> [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 >> [99] javax.enterprise.deploy.shared.DConfigBeanVersionType >> [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 >> [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 >> ... >> >> >> I narrowed it down to level downgrade in compilation policy here: >> https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 >> >> [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. >> >> I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. >> >> I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Shortcut CTW tasks directly Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23589#issuecomment-2660788723 From shade at openjdk.org Sat Feb 15 07:25:16 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Sat, 15 Feb 2025 07:25:16 GMT Subject: Integrated: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 14:47:32 GMT, Aleksey Shipilev wrote: > Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: > > > ... > [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 > [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 > [99] javax.enterprise.deploy.shared.DConfigBeanVersionType > [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 > [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 > ... > > > I narrowed it down to level downgrade in compilation policy here: > https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 > > [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. > > I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. > > I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. > > Additional testing: > - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes This pull request has now been integrated. Changeset: 62345364 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/623453647a8a387b2d8d375cb18b33666abc16ee Stats: 7 lines in 2 files changed: 7 ins; 0 del; 0 mod 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23589 From rgiulietti at openjdk.org Sun Feb 16 09:09:17 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Sun, 16 Feb 2025 09:09:17 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:45:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Improve explanation of logic Here's the pseudo-code for an implementation with 13 vector instructions. Let `fp` denote `float` or `double`. Correspondingly, let `P` = 24, 53 (precision); `L` = 5, 6; `W` = 2^`L` (lane width). The code below is pseudo Java and describes `W`-bit lane operations. Note that each line corresponds to one vector instruction. Further, there's no need for `xtmp3`. // Convert src to floating-point. // First ensure that the bit to the right of the leading 1, if any, is 0. dst = src >>> 1 dst = ~dst & src // If available, prefer a conversion instruction that interprets dst as unsigned. // Otherwise, a correction is needed later (see further down the code). dst = fpToRawBits((fp) dst) // Set xtmp1 = -1 (all one-bits) for later use xtmp1 = -1 // Extract the biased exponent xtmp2 = xtmp1 >>> P dst = dst >>> (P - 1) dst = xtmp2 & dst // Compute the exponent // Set xtmp2 = BIAS xtmp2 = xtmp1 >>> (P + 1) dst = dst - xtmp2 // Set xtmp2 = W - 1 xtmp2 = xtmp1 >>> (W - L) // Adjust for special cases. // We have: src == 0 iff dst < 0 // When src == 0, we force the exponent to -1 dst = dst >= 0 ? dst : xtmp1 // blend // When src < 0, we force the exponent to W - 1. // This is only needed if the conversion to floating-point above interprets its argument as signed. dst = src >= 0 ? dst : xtmp2 // blend // final result dst = xtmp2 - dst ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2661332497 From alanb at openjdk.org Sun Feb 16 15:54:13 2025 From: alanb at openjdk.org (Alan Bateman) Date: Sun, 16 Feb 2025 15:54:13 GMT Subject: RFR: 8349915: CTW: Lots of level 3 compiles are done at level 2 after JDK-8348570 [v2] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 19:50:51 GMT, Aleksey Shipilev wrote: >> Noticed this in manual CTW runs after [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) that lots and lots of methods are compiled at level 2 instead of requested level 3: >> >> >> ... >> [97] javax.enterprise.deploy.shared.ActionType::getValue() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getOffset() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getEnumValueTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getStringTable() WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::getActionType(int) WARNING compilation level = 2, but not 3 >> [97] javax.enterprise.deploy.shared.ActionType::toString() WARNING compilation level = 2, but not 3 >> [99] javax.enterprise.deploy.shared.DConfigBeanVersionType >> [98] javax.enterprise.deploy.shared.CommandType::toString() WARNING compilation level = 2, but not 3 >> [98] javax.enterprise.deploy.shared.CommandType::getOffset() WARNING compilation level = 2, but not 3 >> ... >> >> >> I narrowed it down to level downgrade in compilation policy here: >> https://github.com/openjdk/jdk/blob/ed17c55ea34b3b6009dab11d64f21e0b7af3d701/src/hotspot/share/compiler/compilationPolicy.cpp#L677 >> >> [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) enters here, because we mark all methods as having profiles to extend the CTW scope. So now `is_method_profiled(max_method_h)` is `true` and downgrade happens. There is already check for `!Arguments::is_compiler_only()` there, so I think we better exclude CTW from this downgrade as well. >> >> I looked at possibly making this kind of downgrade fatal in CTW runner, but the error propagation there is not simple. I filed [JDK-8349917](https://bugs.openjdk.org/browse/JDK-8349917) if anyone want to take a stab on it. >> >> I looked at other `set_comp_level()` uses in Hotspot, and this is the only place where it is called. So I presume we have caught all places where this downgrade can happen. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, eyeballing some manual CTW run results >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` passes > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Shortcut CTW tasks directly compiler/tiered/Level2RecompilationTest.java is now failing, tracked by [JDK-8350159](https://bugs.openjdk.org/browse/JDK-8350159). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23589#issuecomment-2661494675 From gcao at openjdk.org Mon Feb 17 00:33:11 2025 From: gcao at openjdk.org (Gui Cao) Date: Mon, 17 Feb 2025 00:33:11 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Thu, 13 Feb 2025 11:12:56 GMT, Gui Cao wrote: >> Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. >> >> >> ### JMH numbers (tested on milkv megrez with hotspot client build): >> >> #### before this patch: >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op >> SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op >> SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op >> SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op >> SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op >> SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op >> SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op >> SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op >> SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op >> SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op >> SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op >> SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op >> SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op >> SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op >> SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op >> SecondarySupersLookup.testPositive03 avgt 15 63.... > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Update for RealFYang's comment Thanks all for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23551#issuecomment-2661709009 From duke at openjdk.org Mon Feb 17 00:33:11 2025 From: duke at openjdk.org (duke) Date: Mon, 17 Feb 2025 00:33:11 GMT Subject: RFR: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Thu, 13 Feb 2025 11:12:56 GMT, Gui Cao wrote: >> Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. >> >> >> ### JMH numbers (tested on milkv megrez with hotspot client build): >> >> #### before this patch: >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op >> SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op >> SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op >> SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op >> SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op >> SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op >> SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op >> SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op >> SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op >> SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op >> SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op >> SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op >> SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op >> SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op >> SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op >> SecondarySupersLookup.testPositive03 avgt 15 63.... > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > Update for RealFYang's comment @zifeihan Your change (at version dd66411637793e2640ee5b954296b85166911f26) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23551#issuecomment-2661709975 From jkarthikeyan at openjdk.org Mon Feb 17 03:27:01 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 17 Feb 2025 03:27:01 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v4] In-Reply-To: References: Message-ID: > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Comments from review, add exhaustive test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23579/files - new: https://git.openjdk.org/jdk/pull/23579/files/36228aea..27ca15c5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=02-03 Stats: 138 lines in 2 files changed: 128 ins; 3 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23579.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23579/head:pull/23579 PR: https://git.openjdk.org/jdk/pull/23579 From jkarthikeyan at openjdk.org Mon Feb 17 03:44:11 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 17 Feb 2025 03:44:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: References: Message-ID: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> On Sun, 16 Feb 2025 09:06:20 GMT, Raffaello Giulietti wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve explanation of logic > > Here's the pseudo-code for an implementation with 13 vector instructions. > > Let `fp` denote `float` or `double`. > Correspondingly, let `P` = 24, 53 (precision); `L` = 5, 6; `W` = 2^`L` (lane width). > > The code below is pseudo Java and describes `W`-bit lane operations. > Note that each line corresponds to one vector instruction. > Further, there's no need for `xtmp3`. > > > // Convert src to floating-point. > // First ensure that the bit to the right of the leading 1, if any, is 0. > dst = src >>> 1 > dst = ~dst & src > // If available, prefer a conversion instruction that interprets dst as unsigned. > // Otherwise, a correction is needed later (see further down the code). > dst = fpToRawBits((fp) dst) > > // Set xtmp1 = -1 (all one-bits) for later use > xtmp1 = -1 > > // Extract the biased exponent > xtmp2 = xtmp1 >>> P > dst = dst >>> (P - 1) > dst = xtmp2 & dst > > // Compute the exponent > // Set xtmp2 = BIAS > xtmp2 = xtmp1 >>> (P + 1) > dst = dst - xtmp2 > > // Set xtmp2 = W - 1 > xtmp2 = xtmp1 >>> (W - L) > > // Adjust for special cases. > > // We have: src == 0 iff dst < 0 > // When src == 0, we force the exponent to -1 > dst = dst >= 0 ? dst : xtmp1 // blend > > // When src < 0, we force the exponent to W - 1. > // This is only needed if the conversion to floating-point above interprets its argument as signed. > dst = src >= 0 ? dst : xtmp2 // blend > > // final result > dst = xtmp2 - dst @rgiulietti Shifting by 1 instead of 24 is a really good idea, it makes showing the validity a lot more simple as you mention. I've applied the suggestion in the latest commit. The updated instruction sequence is also very interesting, I'd like to take a look at it in a followup RFE. I was planning on taking a closer look at the long intrinsic after this patch, since it doesn't use the floating point trick that int does and I was very curious to see what the performance would be like with it. @TobiHartmann I've pushed an adapted version of your test that checks for `numberOfLeadingZeros`/`numberOfTrailingZeros` correctness for int and long. Let me know what you think! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2661896820 From jkarthikeyan at openjdk.org Mon Feb 17 05:00:52 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 17 Feb 2025 05:00:52 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v5] In-Reply-To: References: Message-ID: <9Q3s1n66IsWqU4M-jB_Hb4efv6PsTAiNCgAB_6bsUIo=.b46ad63d-f612-48cb-9c5e-88029e43942e@github.com> > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Address comments from review, refactor test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23413/files - new: https://git.openjdk.org/jdk/pull/23413/files/6daa8ace..8920454d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=03-04 Stats: 355 lines in 4 files changed: 89 ins; 263 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23413/head:pull/23413 PR: https://git.openjdk.org/jdk/pull/23413 From jkarthikeyan at openjdk.org Mon Feb 17 05:13:07 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 17 Feb 2025 05:13:07 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v5] In-Reply-To: References: Message-ID: > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'master' into fix-8349637 - Comments from review, add exhaustive test - Improve explanation of logic - Comments from code review - Fix CountLeadingZerosV miscompile on AVX2 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23579/files - new: https://git.openjdk.org/jdk/pull/23579/files/27ca15c5..e8820bcb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=03-04 Stats: 29960 lines in 1108 files changed: 14599 ins; 9294 del; 6067 mod Patch: https://git.openjdk.org/jdk/pull/23579.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23579/head:pull/23579 PR: https://git.openjdk.org/jdk/pull/23579 From jkarthikeyan at openjdk.org Mon Feb 17 05:14:00 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 17 Feb 2025 05:14:00 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v6] In-Reply-To: References: Message-ID: > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Merge branch 'master' into vectorize-subword - Address comments from review, refactor test - Add new conversions to benchmark - Fix some tests that now vectorize - Implement widening and address comments from review - Subword vectorization ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23413/files - new: https://git.openjdk.org/jdk/pull/23413/files/8920454d..b02408f7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=04-05 Stats: 29959 lines in 1108 files changed: 14599 ins; 9294 del; 6066 mod Patch: https://git.openjdk.org/jdk/pull/23413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23413/head:pull/23413 PR: https://git.openjdk.org/jdk/pull/23413 From aboldtch at openjdk.org Mon Feb 17 06:41:18 2025 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Mon, 17 Feb 2025 06:41:18 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove commented lines left by mistake Not looked at the SA changes. lgtm. src/hotspot/share/code/codeBlob.hpp line 308: > 306: > 307: class Vptr : public CodeBlob::Vptr { > 308: }; Was this needed for some compiler? Or is it to be more explicit about the type hierarchy? ------------- Marked as reviewed by aboldtch (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2620128040 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1957678232 From duke at openjdk.org Mon Feb 17 06:44:11 2025 From: duke at openjdk.org (Nicole Xu) Date: Mon, 17 Feb 2025 06:44:11 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: <52HO_iL9asn1huCdJj82R1AwF1w8ON9HZetrdc9rQyQ=.28e137e0-a7f7-4839-a3e7-eda4f8a6c4f5@github.com> References: <52HO_iL9asn1huCdJj82R1AwF1w8ON9HZetrdc9rQyQ=.28e137e0-a7f7-4839-a3e7-eda4f8a6c4f5@github.com> Message-ID: On Thu, 13 Feb 2025 12:09:43 GMT, Jatin Bhateja wrote: >> test/micro/org/openjdk/bench/jdk/incubator/vector/MaskedLogicOpts.java line 122: >> >>> 120: @Setup(Level.Invocation) >>> 121: public void init_per_invoc() { >>> 122: int512_arr_idx = (int512_arr_idx + 16) & (ARRAYLEN-1); >> >> Benchmark assumes that ARRAYLEN is a POT value, thus it will also be good to use the modulous operator for rounding here, it will be expensive but will not impact the performance of the Benchmarking kernels. > > Please try with following command line > `java -jar target/benchmarks.jar -f 1 -i 2 -wi 1 -w 30 -p ARRAYLEN=30 MaskedLogic` Thanks for pointing that out. Typically, ARRAYLEN is almost always a POT value, which is also assumed by many other benchmarks. Are we realistically going to test with an ARRAYLEN of 30? I think the POT assumption is reasonable for our purposes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22963#discussion_r1957691283 From duke at openjdk.org Mon Feb 17 06:44:12 2025 From: duke at openjdk.org (Nicole Xu) Date: Mon, 17 Feb 2025 06:44:12 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 11:51:53 GMT, Jatin Bhateja wrote: >> Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: >> >> >> java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 >> >> >> The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. >> >> Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. >> >> Additionally, some defined but unused variables have been removed. > > test/micro/org/openjdk/bench/jdk/incubator/vector/MaskedLogicOpts.java line 126: > >> 124: } >> 125: >> 126: @CompilerControl(CompilerControl.Mode.INLINE) > > By making the index hop over 16 ints or 8 longs we may leave gaps in between for 128-bit and 256-bit species, this will unnecessarily include the noise due to cache misses or (on some targets) prefetching additional cache lines which are not usable, thereby impacting the crispness of microbenchmark. Thanks, @jatin-bhateja, for your thorough review. Yes, you're right, the current design does introduce gaps when accessing the data. And other cases in this suite are also designed in a similar way. I am wondering if you're interested in refactoring the code to address this more comprehensively. The primary goal of this pull request is to address an out-of-bounds issue that was blocking our tests. We aimed to ensure all test cases pass and become unblocked. Additionally, our available hardware resources limit our ability to rigorously test a wide range of scenarios at this time. Therefore, we opted to maintain consistency with the existing logic in other cases within the suite and simply fix the crash issue. This approach allows us to unblock the current tests and keep things moving for now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22963#discussion_r1957668625 From thartmann at openjdk.org Mon Feb 17 07:01:24 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 17 Feb 2025 07:01:24 GMT Subject: RFR: 8347499: C2: Make `PhaseIdealLoop` eliminate more redundant safepoints in loops [v2] In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 09:00:38 GMT, Qizheng Xing wrote: >> In `PhaseIdealLoop`, `IdealLoopTree::check_safepts` method checks if any call that is guaranteed to have a safepoint dominates the tail of the loop. In the previous implementation, `check_safepts` would stop if it found a local non-call safepoint. At this time, if there was a call before the safepoint in the dom-path, this safepoint would not be eliminated. >> >> loop-safepoint >> >> This patch changes the behavior of `check_safepts` to not stop when it finds a non-local safepoint. This makes simple loops with one method call ~3.8% faster (on aarch64). >> >> >> Benchmark Mode Cnt Score Error Units >> LoopSafepoint.loopVar avgt 15 208296.259 ? 1350.409 ns/op # baseline >> LoopSafepoint.loopVar avgt 15 200692.874 ? 616.770 ns/op # this patch >> >> >> Testing: tier1-2 on x86_64 and aarch64. > > Qizheng Xing has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into enhance-loop-safepoint-elim > - Add IR test and microbench. > - Make `PhaseIdealLoop` eliminate more redundant safepoints in loops. Thanks, testing looks all clean but I leave this to someone else for review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23057#issuecomment-2662215357 From dfenacci at openjdk.org Mon Feb 17 08:30:13 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 17 Feb 2025 08:30:13 GMT Subject: RFR: 8339889: Several compiler tests ignore vm flags and not marked as flagless [v2] In-Reply-To: References: <5acZ_FmW23VeDgOFMiEuUa60TLxaOcC3wWZVwHFh8EU=.95188fc9-2f54-47af-a91c-4855db76f399@github.com> Message-ID: <7w3Xg7S_9ruBGIBd6sZa_9byn11QMPKXFpVPziooI5U=.3a957c35-2a04-4b2c-becb-24b4b0fe9175@github.com> On Tue, 11 Feb 2025 22:43:35 GMT, Leonid Mesnik wrote: >> Tests >> compiler/c2/TestReduceAllocationAndHeapDump.java >> compiler/calls/NativeCalls.java >> compiler/debug/TestStress.java >> compiler/inlining/TestDuplicatedLateInliningOutput.java >> ignore vm flags using limited process builder and not marked as flagless. >> >> Please note that test >> compiler/inlining/TestDuplicatedLateInliningOutput.java >> is failing with some VM flags. See >> https://bugs.openjdk.org/browse/JDK-8348214 >> >> I haven't excluded test, since it fail with certain non-common flags only. > > Leonid Mesnik has updated the pull request incrementally with one additional commit since the last revision: > > test updated as suggested. Thanks for "cleaning this up" @lmesnik. I just ran a quick grep on `test/hotspot/jtreg/compiler` and noticed that there are a few more tests that use `ProcessTools.createLimitedTestJavaProcessBuilder` but don't have `vm.flagless` and don't seem to be covered by other JBS issues (e.g. `compiler/codecache/CheckLargePages.java`, `compiler/onSpinWait/TestOnSpinWaitAArch64DefaultFlags.java`, `compiler/jvmci/TestUncaughtErrorInCompileMethod.java` or `compiler/jvmci/compilerToVM/GetFlagValueTest.java`). Their main method runs in a new VM (`@run main/othervm`) but then they run other processes with `ProcessTools.createLimitedTestJavaProcessBuilder `. As I understand, vm flags would only affect the main method (which supposedly is not what is being tested). So, I was wondering if it made sense to mark them flagless as well anyway. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23224#issuecomment-2662378337 From rgiulietti at openjdk.org Mon Feb 17 08:52:11 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Mon, 17 Feb 2025 08:52:11 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> References: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> Message-ID: <0kVvHD9uw_o_Rvs-72WrQr5MBfGb9ObZ9Bf1EhslZpM=.2449dca6-ff4b-4498-bccf-c2819c700ee5@github.com> On Mon, 17 Feb 2025 03:41:11 GMT, Jasmine Karthikeyan wrote: >> Here's the pseudo-code for an implementation with 13 vector instructions. >> >> Let `fp` denote `float` or `double`. >> Correspondingly, let `P` = 24, 53 (precision); `L` = 5, 6; `W` = 2^`L` (lane width). >> >> The code below is pseudo Java and describes `W`-bit lane operations. >> Note that each line corresponds to one vector instruction. >> Further, there's no need for `xtmp3`. >> >> >> // Convert src to floating-point. >> // First ensure that the bit to the right of the leading 1, if any, is 0. >> dst = src >>> 1 >> dst = ~dst & src >> // If available, prefer a conversion instruction that interprets dst as unsigned. >> // Otherwise, a correction is needed later (see further down the code). >> dst = fpToRawBits((fp) dst) >> >> // Set xtmp1 = -1 (all one-bits) for later use >> xtmp1 = -1 >> >> // Extract the biased exponent >> xtmp2 = xtmp1 >>> P >> dst = dst >>> (P - 1) >> dst = xtmp2 & dst >> >> // Compute the exponent >> // Set xtmp2 = BIAS >> xtmp2 = xtmp1 >>> (P + 1) >> dst = dst - xtmp2 >> >> // Set xtmp2 = W - 1 >> xtmp2 = xtmp1 >>> (W - L) >> >> // Adjust for special cases. >> >> // We have: src == 0 iff dst < 0 >> // When src == 0, we force the exponent to -1 >> dst = dst >= 0 ? dst : xtmp1 // blend >> >> // When src < 0, we force the exponent to W - 1. >> // This is only needed if the conversion to floating-point above interprets its argument as signed. >> dst = src >= 0 ? dst : xtmp2 // blend >> >> // final result >> dst = xtmp2 - dst > > @rgiulietti Shifting by 1 instead of 24 is a really good idea, it makes showing the validity a lot more simple as you mention. I've applied the suggestion in the latest commit. The updated instruction sequence is also very interesting, I'd like to take a look at it in a followup RFE. I was planning on taking a closer look at the long intrinsic after this patch, since it doesn't use the floating point trick that int does and I was very curious to see what the performance would be like with it. > > @TobiHartmann I've pushed an adapted version of your test that checks for `numberOfLeadingZeros`/`numberOfTrailingZeros` correctness for int and long. Let me know what you think! @jaskarth A real Java implementation of the pseudo-code above has been successfully tested on the whole range of `int` by comparing the outcomes with the standard `Integer.numberOfLeadingZeros()`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2662435302 From bkilambi at openjdk.org Mon Feb 17 09:11:10 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 17 Feb 2025 09:11:10 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations In-Reply-To: <4MpRHYuBSykPsmf5fBiM19eDqaecdNlJCr85x60XIRI=.37c7cdba-5baf-462f-8e43-48b3677a22b1@github.com> References: <4MpRHYuBSykPsmf5fBiM19eDqaecdNlJCr85x60XIRI=.37c7cdba-5baf-462f-8e43-48b3677a22b1@github.com> Message-ID: <_n3qzv9qwTORi2YwTduo58WYHpjB-CX__3cTHhPWxag=.06e48dfd-21e5-46d2-9ee6-57d990f1830a@github.com> On Fri, 14 Feb 2025 06:23:49 GMT, Xiaohong Gong wrote: >> src/hotspot/cpu/aarch64/aarch64_vector.ad line 1574: >> >>> 1572: instruct vsqadd_masked(vReg dst_src1, vReg src2, pRegGov pg) %{ >>> 1573: predicate(UseSVE == 2 && !n->as_SaturatingVector()->is_unsigned()); >>> 1574: match(Set dst_src1 (SaturatingAddV (Binary dst_src1 src2) pg)); >> >> for the masked match rules, should we also add `USE_DEF` effect for `dst_src1` to indicate that this register is both read and written to destructively ? I see that other similarly defined match rules in the ad file do not have this effect defined but I am wondering if this should be done? > > Hi @Bhavana-Kilambi , thanks for looking at this PR! And yes, the `dst_src1` should be `USE_DEF` actually, but I think it's safe not adding the effect here manually. The compiler adlc will add the use-def information for each operands when parsing each match rule. You may look at the code details from https://github.com/openjdk/jdk/blob/master/src/hotspot/share/adlc/formssel.cpp#L939 . Thanks @XiaohongGong , got it :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23608#discussion_r1957863089 From bkilambi at openjdk.org Mon Feb 17 09:24:24 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 17 Feb 2025 09:24:24 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 01:47:10 GMT, Xiaohong Gong wrote: > Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures. > > The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture, with different UseSVE options. Here is the gain details: > > > Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2 > ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x > ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x > ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x > ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x > ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x > ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x > ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x > ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x > ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x > ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x > ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x > ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x > IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x > IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x > IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x > IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x > IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x > IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x > IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x > IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x > IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x > IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x > IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x > IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x > LongMaxVector.SADD 1024 thrpt 30 82.01x 81.74x 82.153x > LongMaxVector... Hi @XiaohongGong , I have tested jdk:tier 1-3, langtools:tier 1, hotspot:hotspot_all on a couple of Arm machines - N1 (128-bit, SVE == 0) and Graviton 3 (V1 , 256-bit, SVE == 1) and all the JTREG tests pass on them. Did not run them on Grace as I thought you must have done that already. If not, please do let me know and I can run a pass on Grace as well :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23608#issuecomment-2662512350 From syan at openjdk.org Mon Feb 17 09:33:43 2025 From: syan at openjdk.org (SendaoYan) Date: Mon, 17 Feb 2025 09:33:43 GMT Subject: RFR: 8350178: Incorrect comment after JDK-8345580 Message-ID: Hi all, In [JDK-8345580](https://bugs.openjdk.org/browse/JDK-8345580), the const modifier for variable Node::_idx has been removed, but the constant description for variable Node::_idx leave unchange. The related description should be updated. Only touch the comments, no risk. ------------- Commit messages: - 8350178: Incorrect comment after JDK-8345580 Changes: https://git.openjdk.org/jdk/pull/23659/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23659&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350178 Stats: 4 lines in 1 file changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23659.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23659/head:pull/23659 PR: https://git.openjdk.org/jdk/pull/23659 From xgong at openjdk.org Mon Feb 17 09:57:09 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 17 Feb 2025 09:57:09 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 09:21:45 GMT, Bhavana Kilambi wrote: > Hi @XiaohongGong , I have tested jdk:tier 1-3, langtools:tier 1, hotspot:hotspot_all on a couple of Arm machines - N1 (128-bit, SVE == 0) and Graviton 3 (V1 , 256-bit, SVE == 1) and all the JTREG tests pass on them. Did not run them on Grace as I thought you must have done that already. If not, please do let me know and I can run a pass on Grace as well :) Thanks for your testing @Bhavana-Kilambi ! It's really great. Yes, I'v passed all the tests on Grace with different SVE options as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23608#issuecomment-2662596314 From epeter at openjdk.org Mon Feb 17 10:19:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Feb 2025 10:19:12 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> References: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> Message-ID: <6jaHfsQv-Ltq5s1jp4Ny04bUfC5i-rYNqHPFJ7Eiuk0=.9e07ca80-c8cf-4211-90cb-fb29b6f33638@github.com> On Mon, 17 Feb 2025 03:41:11 GMT, Jasmine Karthikeyan wrote: >> Here's the pseudo-code for an implementation with 13 vector instructions. >> >> Let `fp` denote `float` or `double`. >> Correspondingly, let `P` = 24, 53 (precision); `L` = 5, 6; `W` = 2^`L` (lane width). >> >> The code below is pseudo Java and describes `W`-bit lane operations. >> Note that each line corresponds to one vector instruction. >> Further, there's no need for `xtmp3`. >> >> >> // Convert src to floating-point. >> // First ensure that the bit to the right of the leading 1, if any, is 0. >> dst = src >>> 1 >> dst = ~dst & src >> // If available, prefer a conversion instruction that interprets dst as unsigned. >> // Otherwise, a correction is needed later (see further down the code). >> dst = fpToRawBits((fp) dst) >> >> // Set xtmp1 = -1 (all one-bits) for later use >> xtmp1 = -1 >> >> // Extract the biased exponent >> xtmp2 = xtmp1 >>> P >> dst = dst >>> (P - 1) >> dst = xtmp2 & dst >> >> // Compute the exponent >> // Set xtmp2 = BIAS >> xtmp2 = xtmp1 >>> (P + 1) >> dst = dst - xtmp2 >> >> // Set xtmp2 = W - 1 >> xtmp2 = xtmp1 >>> (W - L) >> >> // Adjust for special cases. >> >> // We have: src == 0 iff dst < 0 >> // When src == 0, we force the exponent to -1 >> dst = dst >= 0 ? dst : xtmp1 // blend >> >> // When src < 0, we force the exponent to W - 1. >> // This is only needed if the conversion to floating-point above interprets its argument as signed. >> dst = src >= 0 ? dst : xtmp2 // blend >> >> // final result >> dst = xtmp2 - dst > > @rgiulietti Shifting by 1 instead of 24 is a really good idea, it makes showing the validity a lot more simple as you mention. I've applied the suggestion in the latest commit. The updated instruction sequence is also very interesting, I'd like to take a look at it in a followup RFE. I was planning on taking a closer look at the long intrinsic after this patch, since it doesn't use the floating point trick that int does and I was very curious to see what the performance would be like with it. > > @TobiHartmann I've pushed an adapted version of your test that checks for `numberOfLeadingZeros`/`numberOfTrailingZeros` correctness for int and long. Let me know what you think! @jaskarth I filed the **truncation** issue here: https://bugs.openjdk.org/browse/JDK-8350177 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2662654191 From chagedorn at openjdk.org Mon Feb 17 10:31:11 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 17 Feb 2025 10:31:11 GMT Subject: RFR: 8350178: Incorrect comment after JDK-8345580 In-Reply-To: References: Message-ID: <6G7YfdXgPUX6kzfsyFZs3cdB34KMf1naiYP9zYec1MM=.77cca655-f5c5-47e1-87b5-12baf19c6d08@github.com> On Mon, 17 Feb 2025 09:28:38 GMT, SendaoYan wrote: > Hi all, > In [JDK-8345580](https://bugs.openjdk.org/browse/JDK-8345580), the const modifier for variable Node::_idx has been removed, but the constant description for variable Node::_idx leave unchange. The related description should be updated. > > Only touch the comments, no risk. Otherwise, looks good and trivial src/hotspot/share/opto/node.hpp line 340: > 338: > 339: public: > 340: // Each Node is assigned a unique small/dense number. This number is used While touching the comments anyways: Suggestion: // Each Node is assigned a unique small/dense number. This number is used ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23659#pullrequestreview-2620644322 PR Review Comment: https://git.openjdk.org/jdk/pull/23659#discussion_r1957988176 From epeter at openjdk.org Mon Feb 17 10:46:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Feb 2025 10:46:12 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 07:52:30 GMT, Nicole Xu wrote: >> Oh, the OCA-verify is still stuck. I'm sorry about that ? >> I pinged my manager @TobiHartmann , he will reach out to see what's the issue. > > Hi @eme64, do you see any risks here? Would you please help to review the patch? Thanks. @xyyNicole @jatin-bhateja I think it is reasonable to just fix the benchmark so that it still has the same behaviour, just without the out-of-bounds exception. @jatin-bhateja you originally wrote the benchmark, and it could make sense if you fixed it up to what it should be more ideally. @xyyNicole I propose that we file a follow-up RFE to fix the benchmark, and just mention that issue in the benchmark. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2662722979 From syan at openjdk.org Mon Feb 17 10:52:00 2025 From: syan at openjdk.org (SendaoYan) Date: Mon, 17 Feb 2025 10:52:00 GMT Subject: RFR: 8350178: Incorrect comment after JDK-8345580 [v2] In-Reply-To: References: Message-ID: > Hi all, > In [JDK-8345580](https://bugs.openjdk.org/browse/JDK-8345580), the const modifier for variable Node::_idx has been removed, but the constant description for variable Node::_idx leave unchange. The related description should be updated. > > Only touch the comments, no risk. SendaoYan has updated the pull request incrementally with one additional commit since the last revision: Remove extra a whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23659/files - new: https://git.openjdk.org/jdk/pull/23659/files/0f65a393..78fd3da5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23659&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23659&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23659.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23659/head:pull/23659 PR: https://git.openjdk.org/jdk/pull/23659 From syan at openjdk.org Mon Feb 17 10:52:00 2025 From: syan at openjdk.org (SendaoYan) Date: Mon, 17 Feb 2025 10:52:00 GMT Subject: RFR: 8350178: Incorrect comment after JDK-8345580 [v2] In-Reply-To: <6G7YfdXgPUX6kzfsyFZs3cdB34KMf1naiYP9zYec1MM=.77cca655-f5c5-47e1-87b5-12baf19c6d08@github.com> References: <6G7YfdXgPUX6kzfsyFZs3cdB34KMf1naiYP9zYec1MM=.77cca655-f5c5-47e1-87b5-12baf19c6d08@github.com> Message-ID: On Mon, 17 Feb 2025 10:28:14 GMT, Christian Hagedorn wrote: >> SendaoYan has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove extra a whitespace > > src/hotspot/share/opto/node.hpp line 340: > >> 338: >> 339: public: >> 340: // Each Node is assigned a unique small/dense number. This number is used > > While touching the comments anyways: > Suggestion: > > // Each Node is assigned a unique small/dense number. This number is used Thanks. The extra whitespace has been removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23659#discussion_r1958018929 From fyang at openjdk.org Mon Feb 17 11:26:11 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 17 Feb 2025 11:26:11 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector In-Reply-To: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: On Thu, 13 Feb 2025 14:20:40 GMT, Hamlin Li wrote: > Hi, > Can you help to review the patch? > This optimization is mainly for the vector API. > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 > SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 > SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 > SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 > SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 > SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 > SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 > SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 > SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 > SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 > SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 > SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 > > Seems fine. I have two minor questions. Thanks. src/hotspot/cpu/riscv/riscv_v.ad line 103: > 101: break; > 102: case Op_SelectFromTwoVector: > 103: return true; Seems not necessary to add one more case here as the default will return true for this case. src/hotspot/cpu/riscv/riscv_v.ad line 4447: > 4445: __ vsetvli_helper(bt, Matcher::vector_length(this)); > 4446: __ vrgather_vv(as_VectorRegister($dst$$reg), as_VectorRegister($src1$$reg), > 4447: as_VectorRegister($index$$reg)); I suppose the indices here comes with the same element width as the two vector sources? The spec says `vrgather.vv` uses same SEW/LMUL for both the data and indices. ------------- PR Review: https://git.openjdk.org/jdk/pull/23614#pullrequestreview-2620754379 PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1958053692 PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1958066222 From rrich at openjdk.org Mon Feb 17 11:30:13 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 17 Feb 2025 11:30:13 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Fri, 14 Feb 2025 22:38:23 GMT, Dean Long wrote: > > I think you can make the assertion a little stricter like this [reinrich at 9c3c8a3](https://github.com/reinrich/jdk/commit/9c3c8a33a29b9ae6c4c703992b306dc0cbbcd2f0). > > Regarding this stricter version, why are you using is_bottom_frame instead of is_top_frame? The deoptimization code seems to name the most recent leaf frame "top". That sounds like what frame::top_ijava_frame_abi_size is for too. Correct, the top frame has a frame::top_ijava_frame_abi but the assertion is about the abi section in the current frame's caller and the the bottom frame's caller also has a top_ijava_frame_abi because i2c doesn't modify it. Continue reading if you're interested in more details... As said the i2c adapter does *not* trimm the caller frame as the interpreter would, replacing its large `top_ijava_frame_abi` with a smaller `parent_ijava_frame_abi`. Example: compiled frame DEOPTEE is replaced with 3 interpreted frames Stack before deoptimization | | | Interpreted CALLER | | of DEOPTEE frame | | | +------------------------+ | | | top_ijava_frame_abi | | | +========================+ | | | Compiled | | DEOPTEE | | | +------------------------+ | java_abi | +========================+ Stack when assertion is checked (i.e. after DEOPTEE was replaced by corresponding inter. frames) | | | Interpreted CALLER | | of DEOPTEE frame | | | +------------------------+ | | | top_ijava_frame_abi | <- i2c keeps large abi | | +========================+ | | <- bottom frame | Interpreted Frame 0 | | corresp. to DEOPTEE | | | +------------------------+ | parent_ijava_frame_abi | +========================+ | | | Interpreted Frame 1 | | (inlined by DEOPTEE) | | | +------------------------+ | parent_ijava_frame_abi | +========================+ | | <- top frame | Interpreted Frame 2 | | (inlined by DEOPTEE) | | | +------------------------+ | | | top_ijava_frame_abi | | | +========================+ Notes: (refering to the frame sections rather than the C++ types) - top_ijava_frame_abi complies to the native abi (modelled by frame::native_abi_reg_args). This is needed for VM calls. - parent_ijava_frame_abi is equal to frame::java_abi. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2662835374 From gcao at openjdk.org Mon Feb 17 11:38:22 2025 From: gcao at openjdk.org (Gui Cao) Date: Mon, 17 Feb 2025 11:38:22 GMT Subject: Integrated: 8349764: RISC-V: C1: Improve Class.isInstance intrinsic In-Reply-To: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> References: <4oU1EPodlsbgVZE6GstRU_xugwAaKT_g-geBascyyTg=.586479be-ffdb-41e0-a4e2-7d28a1804757@github.com> Message-ID: On Tue, 11 Feb 2025 03:15:47 GMT, Gui Cao wrote: > Follow this patch https://github.com/openjdk/jdk/pull/22491, RISC-V implementation for Class.isInstance intrinsic. > > > ### JMH numbers (tested on milkv megrez with hotspot client build): > > #### before this patch: > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 48.589 ? 0.981 ns/op > SecondarySupersLookup.testNegative01 avgt 15 48.577 ? 0.297 ns/op > SecondarySupersLookup.testNegative02 avgt 15 48.760 ? 0.740 ns/op > SecondarySupersLookup.testNegative03 avgt 15 48.442 ? 0.029 ns/op > SecondarySupersLookup.testNegative04 avgt 15 48.453 ? 0.095 ns/op > SecondarySupersLookup.testNegative05 avgt 15 48.435 ? 0.025 ns/op > SecondarySupersLookup.testNegative06 avgt 15 48.540 ? 0.476 ns/op > SecondarySupersLookup.testNegative07 avgt 15 48.452 ? 0.032 ns/op > SecondarySupersLookup.testNegative08 avgt 15 48.466 ? 0.034 ns/op > SecondarySupersLookup.testNegative09 avgt 15 48.478 ? 0.132 ns/op > SecondarySupersLookup.testNegative10 avgt 15 48.435 ? 0.032 ns/op > SecondarySupersLookup.testNegative16 avgt 15 48.440 ? 0.027 ns/op > SecondarySupersLookup.testNegative20 avgt 15 47.977 ? 0.989 ns/op > SecondarySupersLookup.testNegative30 avgt 15 48.655 ? 0.487 ns/op > SecondarySupersLookup.testNegative32 avgt 15 48.566 ? 0.251 ns/op > SecondarySupersLookup.testNegative40 avgt 15 48.513 ? 0.196 ns/op > SecondarySupersLookup.testNegative50 avgt 15 48.454 ? 0.075 ns/op > SecondarySupersLookup.testNegative55 avgt 15 71.670 ? 1.632 ns/op > SecondarySupersLookup.testNegative56 avgt 15 70.923 ? 1.679 ns/op > SecondarySupersLookup.testNegative57 avgt 15 70.140 ? 0.048 ns/op > SecondarySupersLookup.testNegative58 avgt 15 70.473 ? 0.726 ns/op > SecondarySupersLookup.testNegative59 avgt 15 70.127 ? 0.022 ns/op > SecondarySupersLookup.testNegative60 avgt 15 82.525 ? 1.178 ns/op > SecondarySupersLookup.testNegative61 avgt 15 81.647 ? 0.758 ns/op > SecondarySupersLookup.testNegative62 avgt 15 82.347 ? 1.943 ns/op > SecondarySupersLookup.testNegative63 avgt 15 129.188 ? 1.550 ns/op > SecondarySupersLookup.testNegative64 avgt 15 130.274 ? 1.668 ns/op > SecondarySupersLookup.testPositive01 avgt 15 63.390 ? 0.222 ns/op > SecondarySupersLookup.testPositive02 avgt 15 63.435 ? 0.027 ns/op > SecondarySupersLookup.testPositive03 avgt 15 63.469 ? 0.080 ns/op > SecondarySupersLookup.testPositive04 avgt 15 63.896 ... This pull request has now been integrated. Changeset: b3a4026c Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/b3a4026c65eb049eb4f3a3cbf52c9f0c9979a256 Stats: 47 lines in 2 files changed: 46 ins; 0 del; 1 mod 8349764: RISC-V: C1: Improve Class.isInstance intrinsic Reviewed-by: fyang, mli ------------- PR: https://git.openjdk.org/jdk/pull/23551 From dfenacci at openjdk.org Mon Feb 17 11:58:11 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 17 Feb 2025 11:58:11 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 09:59:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. > > Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: > > > java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 > > > produces the following visualization for the `Initial spilling` phase: > > ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) > > Live ranges are first-class IGV entities, meaning that the user can: > > - search, select, and extract them; > > ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) > > - examine their properties in the `Properties` window or via tooltips; > > ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) > > - navigate to related IGV entities via a pop-up menu; and > > ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) > > - program filters that act om them according to their properties. > > ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) > > Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: > > ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c) > > The changeset extends the IGV graph printing logic in HotSpot t... @robcasloz thanks a lot for this amazing improvement! Just a quick one: I noticed that, with your `java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4` example with the `Initial spilling` phase selected, if I click on the live range icon image to make it disappear, I get a `NullPointerException` image (no further details apart from the exception) ------------- PR Review: https://git.openjdk.org/jdk/pull/23558#pullrequestreview-2620847923 From chagedorn at openjdk.org Mon Feb 17 11:58:11 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 17 Feb 2025 11:58:11 GMT Subject: RFR: 8350178: Incorrect comment after JDK-8345580 [v2] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 10:52:00 GMT, SendaoYan wrote: >> Hi all, >> In [JDK-8345580](https://bugs.openjdk.org/browse/JDK-8345580), the const modifier for variable Node::_idx has been removed, but the constant description for variable Node::_idx leave unchange. The related description should be updated. >> >> Only touch the comments, no risk. > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > Remove extra a whitespace Looks good thanks. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23659#pullrequestreview-2620849430 From epeter at openjdk.org Mon Feb 17 12:05:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Feb 2025 12:05:16 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions @mernst-github Tests are passing. Nice work, and thanks for sticking with us! Approved. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22856#pullrequestreview-2620865156 From epeter at openjdk.org Mon Feb 17 12:06:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Feb 2025 12:06:17 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v3] In-Reply-To: References: Message-ID: On Sun, 9 Feb 2025 05:59:37 GMT, Jasmine Karthikeyan wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix some tests that now vectorize > > I also updated the benchmark, and got these results: > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.byteToInt 1024 avgt 12 185.700 ? 0.798 ns/op 37.427 ? 0.276 ns/op (4.96x) > VectorSubword.byteToShort 1024 avgt 12 240.737 ? 1.087 ns/op 23.094 ? 0.502 ns/op (10.42x) > VectorSubword.intToByte 1024 avgt 12 181.680 ? 0.553 ns/op 49.873 ? 1.613 ns/op (3.64x) > VectorSubword.intToShort 1024 avgt 12 176.256 ? 1.414 ns/op 43.933 ? 4.310 ns/op (4.01x) > VectorSubword.shortToByte 1024 avgt 12 245.600 ? 6.217 ns/op 28.426 ? 0.649 ns/op (8.64x) > VectorSubword.shortToInt 1024 avgt 12 178.364 ? 2.921 ns/op 34.140 ? 0.229 ns/op (5.22x) @jaskarth just ping me whenever I should have a look again! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2662917708 From duke at openjdk.org Mon Feb 17 12:22:28 2025 From: duke at openjdk.org (duke) Date: Mon, 17 Feb 2025 12:22:28 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions @mernst-github Your change (at version b7a16a17049d7c6ef4f2305dd534fbe5cafc1703) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2662955125 From duke at openjdk.org Mon Feb 17 12:22:27 2025 From: duke at openjdk.org (Matthias Ernst) Date: Mon, 17 Feb 2025 12:22:27 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: <83jq5rJ72L30aVfKQB9eWgyFlsTz6wGFTr8uW7hV8AE=.00b5c762-0bf7-4bb7-b0ae-4da63a0703f6@github.com> On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions Super, glad this worked out! I want to return the compliment, thanks for sticking with me :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2662952533 From syan at openjdk.org Mon Feb 17 13:10:00 2025 From: syan at openjdk.org (SendaoYan) Date: Mon, 17 Feb 2025 13:10:00 GMT Subject: RFR: 8350197: [UBSAN] Node::dump_idx reported float-cast-overflow Message-ID: Hi all, The function of 'Node::dump_idx(bool, outputStream*, Node::DumpConfig*)' in file src/hotspot/share/opto/node.cpp:2430 reported "runtime error: -inf is outside the range of representable values of type 'unsigned int'" by clang17's UndefinedBehaviorSanitizer. This PR add an extra check for the argument before pass call to `log10`. Risk is low. Additional testing: - [ ] Jtreg tests(include tier1/2/3 etc.) on linux-x64 with release build - [ ] Jtreg tests(include tier1/2/3 etc.) on linux-aarch64 with release build Below code snippet demonstrate the undefined behaviour of float-cast-overflow: #include #include int input = 0; int main() { printf("result = %lf\n", log10((double)input)); printf("result = %u\n", (unsigned int)log10((double)input)); printf("result = %u\n", input==0 ? 0 : (unsigned int)log10((double)input)); return 0; } > clang -fsanitize=undefined log10.c -lm && ./a.out result = -inf log10.c:9:27: runtime error: -inf is outside the range of representable values of type 'unsigned int' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior log10.c:9:27 in result = 0 result = 0 ------------- Commit messages: - add necessary braces - 8350197: [UBSAN] Node::dump_idx reported float-cast-overflow Changes: https://git.openjdk.org/jdk/pull/23662/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23662&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350197 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23662.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23662/head:pull/23662 PR: https://git.openjdk.org/jdk/pull/23662 From duke at openjdk.org Mon Feb 17 13:10:22 2025 From: duke at openjdk.org (Matthias Ernst) Date: Mon, 17 Feb 2025 13:10:22 GMT Subject: Integrated: 8346664: C2: Optimize mask check with constant offset In-Reply-To: References: Message-ID: On Sat, 21 Dec 2024 14:08:02 GMT, Matthias Ernst wrote: > Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. > > Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: > > > (base + (index + 1) << 8) & 255 > => MulNode > (base + (index << 8 + 256)) & 255 > => AddNode > ((base + index << 8) + 256) & 255 > > > Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: > > > ((base + index << 8) + 256) & 255 > => MulNode (this PR) > (base + index << 8) & 255 > => MulNode (PR #6697) > base & 255 (loop invariant) > > > Implementation notes: > * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. > * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ > * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ This pull request has now been integrated. Changeset: 7f3ecb4d Author: Matthias Ernst URL: https://git.openjdk.org/jdk/commit/7f3ecb4d92fdb084ce632cab484cf4578487b090 Stats: 476 lines in 5 files changed: 291 ins; 49 del; 136 mod 8346664: C2: Optimize mask check with constant offset Reviewed-by: epeter, qamai ------------- PR: https://git.openjdk.org/jdk/pull/22856 From rcastanedalo at openjdk.org Mon Feb 17 13:16:57 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 17 Feb 2025 13:16:57 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v3] In-Reply-To: References: Message-ID: > This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: > > ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) > > #### Testing > > - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > > - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Dump alias type information for each node ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23621/files - new: https://git.openjdk.org/jdk/pull/23621/files/53258db4..af144195 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23621&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23621&range=01-02 Stats: 29 lines in 1 file changed: 29 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23621.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23621/head:pull/23621 PR: https://git.openjdk.org/jdk/pull/23621 From qamai at openjdk.org Mon Feb 17 13:38:21 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 17 Feb 2025 13:38:21 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: <83jq5rJ72L30aVfKQB9eWgyFlsTz6wGFTr8uW7hV8AE=.00b5c762-0bf7-4bb7-b0ae-4da63a0703f6@github.com> References: <83jq5rJ72L30aVfKQB9eWgyFlsTz6wGFTr8uW7hV8AE=.00b5c762-0bf7-4bb7-b0ae-4da63a0703f6@github.com> Message-ID: On Mon, 17 Feb 2025 12:18:23 GMT, Matthias Ernst wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> incorporate @eme64's comment suggestions > > Super, glad this worked out! I want to return the compliment, thanks for sticking with me :-) @mernst-github Thanks a lot for going with us through this process, your contribution will be engraved in the (git) history ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2663161677 From rcastanedalo at openjdk.org Mon Feb 17 13:46:10 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 17 Feb 2025 13:46:10 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v2] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 12:47:07 GMT, Quan Anh Mai wrote: > In `MemNode::dump_adr_type`, we have this: > > ``` > ciField* field = atp->field(); > if (field) { > st->print(", name="); > field->print_name_on(st); > } > st->print(", idx=%d;", atp->index()); > ``` > > By details, I mean the things like `field->print_name_on(st)` here. It would be good if we can know all the details about the memory slice from looking at the IGV, we can have it being an option that can be toggled if you are afraid that it would be too long. The latest commit (af144195) dumps more information about the alias type ("slice") related to each node. I did not add this additional information to the type info shown by the "Show types" for compactness, but the new properties can be easily inspected in the Properties window. If you really want to show them on every node, you can e.g. set the "Node Text" field in the IGV Options window to something like `[idx] [name] ([alias_index] : [alias_field][alias_element])`. This would produce something like this (note the additional "(12 next)" text in the label of 444 LoadN): ![node-labels](https://github.com/user-attachments/assets/bc63338e-cbd1-49b3-a662-c1c7354878d8) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2663181170 From mli at openjdk.org Mon Feb 17 13:49:17 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 17 Feb 2025 13:49:17 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector In-Reply-To: References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: On Mon, 17 Feb 2025 11:13:09 GMT, Fei Yang wrote: >> Hi, >> Can you help to review the patch? >> This optimization is mainly for the vector API. >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 >> SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 >> SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 >> SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 >> SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 >> SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 >> SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 >> SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 >> SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 >> SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 >> SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 >> SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 >> >> > > src/hotspot/cpu/riscv/riscv_v.ad line 103: > >> 101: break; >> 102: case Op_SelectFromTwoVector: >> 103: return true; > > Seems not necessary to add one more case here as the default will return true for this case. Logically, yes. It's not necessary, my thought is to make the code explicit, so friendly to read code. But I can remove it if you think it's better to do so. > src/hotspot/cpu/riscv/riscv_v.ad line 4447: > >> 4445: __ vsetvli_helper(bt, Matcher::vector_length(this)); >> 4446: __ vrgather_vv(as_VectorRegister($dst$$reg), as_VectorRegister($src1$$reg), >> 4447: as_VectorRegister($index$$reg)); > > I suppose the indices here comes with the same element width as the two vector sources? The spec says `vrgather.vv` uses same SEW/LMUL for both the data and indices. Yes, I think so. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1958266514 PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1958269972 From rcastanedalo at openjdk.org Mon Feb 17 13:56:22 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 17 Feb 2025 13:56:22 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 09:59:48 GMT, Roberto Casta?eda Lozano wrote: > This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. > > Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: > > > java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 > > > produces the following visualization for the `Initial spilling` phase: > > ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) > > Live ranges are first-class IGV entities, meaning that the user can: > > - search, select, and extract them; > > ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) > > - examine their properties in the `Properties` window or via tooltips; > > ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) > > - navigate to related IGV entities via a pop-up menu; and > > ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) > > - program filters that act om them according to their properties. > > ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) > > Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: > > ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c) > > The changeset extends the IGV graph printing logic in HotSpot t... > @robcasloz thanks a lot for this amazing improvement! > > Just a quick one: I noticed that, with your `java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4` example with the `Initial spilling` phase selected, if I click on the live range icon image to make it disappear, I get a `NullPointerException` image (no further details apart from the exception) Thanks for the report Damon, will investigate! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2663204622 From mli at openjdk.org Mon Feb 17 14:04:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 17 Feb 2025 14:04:55 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v2] In-Reply-To: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: > Hi, > Can you help to review the patch? > This optimization is mainly for the vector API. > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 > SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 > SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 > SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 > SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 > SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 > SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 > SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 > SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 > SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 > SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 > SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: minor ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23614/files - new: https://git.openjdk.org/jdk/pull/23614/files/a28926b6..e8b29c62 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23614&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23614&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23614.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23614/head:pull/23614 PR: https://git.openjdk.org/jdk/pull/23614 From fyang at openjdk.org Mon Feb 17 14:04:55 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 17 Feb 2025 14:04:55 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v2] In-Reply-To: References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: <5RYEeoFim3MXrY8HgFG23gTa1aoDcQUUjYRZNvwC8yI=.dee4e6a4-9bd2-4207-ab0e-8e4141168493@github.com> On Mon, 17 Feb 2025 13:44:18 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/riscv_v.ad line 103: >> >>> 101: break; >>> 102: case Op_SelectFromTwoVector: >>> 103: return true; >> >> Seems not necessary to add one more case here as the default will return true for this case. > > Logically, yes. It's not necessary, my thought is to make the code explicit, so friendly to read code. But I can remove it if you think it's better to do so. Yes, please. The practice here is to only list the opcodes whose availability would also depend on some other facts like vector size, etc. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1958288153 From mli at openjdk.org Mon Feb 17 14:04:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 17 Feb 2025 14:04:55 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v2] In-Reply-To: <5RYEeoFim3MXrY8HgFG23gTa1aoDcQUUjYRZNvwC8yI=.dee4e6a4-9bd2-4207-ab0e-8e4141168493@github.com> References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> <5RYEeoFim3MXrY8HgFG23gTa1aoDcQUUjYRZNvwC8yI=.dee4e6a4-9bd2-4207-ab0e-8e4141168493@github.com> Message-ID: On Mon, 17 Feb 2025 13:59:29 GMT, Fei Yang wrote: >> Logically, yes. It's not necessary, my thought is to make the code explicit, so friendly to read code. But I can remove it if you think it's better to do so. > > Yes, please. The practice here is to only list the opcodes whose availability would also depend on some other facts like vector size, etc. Sure, fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1958291605 From roland at openjdk.org Mon Feb 17 14:19:12 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 17 Feb 2025 14:19:12 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter wrote: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this? Is the motivation to use this as a way to do prep work for alias analysis? Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks) I went over the code and it looks reasonable to me. I intend to do a more careful review later. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2663262133 From fyang at openjdk.org Mon Feb 17 14:39:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 17 Feb 2025 14:39:10 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v2] In-Reply-To: References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: On Mon, 17 Feb 2025 14:04:55 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch? >> This optimization is mainly for the vector API. >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 >> SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 >> SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 >> SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 >> SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 >> SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 >> SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 >> SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 >> SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 >> SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 >> SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 >> SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 >> >> > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > minor LGTM. Thanks for the update. src/hotspot/cpu/riscv/riscv_v.ad line 4458: > 4456: } > 4457: __ vrgather_vv(as_VectorRegister($dst$$reg), as_VectorRegister($src2$$reg), > 4458: as_VectorRegister($tmp$$reg), Assembler::v0_t); Nit: need one extra space to fix the indentation for this line ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23614#pullrequestreview-2621243021 PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1958340720 From jkarthikeyan at openjdk.org Mon Feb 17 15:03:14 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 17 Feb 2025 15:03:14 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v3] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 12:03:30 GMT, Emanuel Peter wrote: >> I also updated the benchmark, and got these results: >> >> Baseline Patch >> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement >> VectorSubword.byteToInt 1024 avgt 12 185.700 ? 0.798 ns/op 37.427 ? 0.276 ns/op (4.96x) >> VectorSubword.byteToShort 1024 avgt 12 240.737 ? 1.087 ns/op 23.094 ? 0.502 ns/op (10.42x) >> VectorSubword.intToByte 1024 avgt 12 181.680 ? 0.553 ns/op 49.873 ? 1.613 ns/op (3.64x) >> VectorSubword.intToShort 1024 avgt 12 176.256 ? 1.414 ns/op 43.933 ? 4.310 ns/op (4.01x) >> VectorSubword.shortToByte 1024 avgt 12 245.600 ? 6.217 ns/op 28.426 ? 0.649 ns/op (8.64x) >> VectorSubword.shortToInt 1024 avgt 12 178.364 ? 2.921 ns/op 34.140 ? 0.229 ns/op (5.22x) > > @jaskarth just ping me whenever I should have a look again! @eme64 I think it should be good for another look over! I've addressed your review comments in the last commit. About the potential for performance degradation, I think it would be unlikely since the code generated by the cast is quite small (as it only needs to truncate or sign-extend) and the patch increases the amount of possible code that can auto-vectorize. The one case that I can think of is that it might cause code that would be otherwise unprofitable to become vectorizable, but that would be because we don't have a cost model yet. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2663375243 From rehn at openjdk.org Mon Feb 17 15:11:22 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 17 Feb 2025 15:11:22 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v3] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:04:24 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? >> This optimization is mainly for the vector API. >> On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). >> >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% >> ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% >> DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% >> DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% >> FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% >> FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% >> IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% >> IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% >> LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% >> LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% >> ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% >> ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% >> >> > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > mimor Thank you! ------------- Marked as reviewed by rehn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23580#pullrequestreview-2621326645 From mli at openjdk.org Mon Feb 17 15:15:14 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 17 Feb 2025 15:15:14 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v3] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:04:24 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? >> This optimization is mainly for the vector API. >> On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). >> >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% >> ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% >> DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% >> DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% >> FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% >> FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% >> IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% >> IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% >> LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% >> LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% >> ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% >> ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% >> >> > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > mimor Thank you! > Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23580#issuecomment-2663405696 From mli at openjdk.org Mon Feb 17 15:15:27 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 17 Feb 2025 15:15:27 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v3] In-Reply-To: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: <3booXWD4LZQaI9r1VftWWjuFUN71kNIrwsgNTXjSkQQ=.d6844332-2d54-4dca-a44e-4e8bf6339e5d@github.com> > Hi, > Can you help to review the patch? > This optimization is mainly for the vector API. > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 > SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 > SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 > SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 > SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 > SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 > SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 > SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 > SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 > SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 > SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 > SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: space ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23614/files - new: https://git.openjdk.org/jdk/pull/23614/files/e8b29c62..086e3023 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23614&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23614&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23614.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23614/head:pull/23614 PR: https://git.openjdk.org/jdk/pull/23614 From mli at openjdk.org Mon Feb 17 15:15:28 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 17 Feb 2025 15:15:28 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v2] In-Reply-To: References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: On Mon, 17 Feb 2025 14:34:56 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> minor > > src/hotspot/cpu/riscv/riscv_v.ad line 4458: > >> 4456: } >> 4457: __ vrgather_vv(as_VectorRegister($dst$$reg), as_VectorRegister($src2$$reg), >> 4458: as_VectorRegister($tmp$$reg), Assembler::v0_t); > > Nit: need one extra space to fix the indentation for this line sure, fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1958398277 From epeter at openjdk.org Mon Feb 17 15:28:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Feb 2025 15:28:13 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 14:16:59 GMT, Roland Westrelin wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this? Is the motivation to use this as a way to do prep work for alias analysis? > > Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks) > > I went over the code and it looks reasonable to me. I intend to do a more careful review later. @rwestrel Thanks for having a first look! > What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this? Yes, x86 and aarch64 are unaffected, as far as I know. Well, we can simulate strict alignment with `-XX:+AlignVector`, and there it should behave correctly, and it currently fails with the `-XX:+VerifyAlignVector`. It would be nice if that was not the case, so that we can write tests with arbitrary alignment, and turn on those flags freely. > Is the motivation to use this as a way to do prep work for alias analysis? I see this as a bug-fix AND preparation for future work. I suppose I might not have fixed this bug here since our platforms are not really affected, but I might as well fix it now since I can re-use most of the code later. > Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks) I suppose that is currently what I'm planning. But we could in principle separate them. But I would leave that for later, if there is any desire to do that. For now, I think it's ok to just go with a single "auto-vectorization" reason. Does that sound reasonable? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2663434802 From dnsimon at openjdk.org Mon Feb 17 16:12:41 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 16:12:41 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: > In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. > > This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that JVMCI is still experimental and only has qualified exports to Graal, I don't think this needs a CSR. Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 - remove non-native-image build time use of ServiceLoader - make Cleaner.clean public ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22869/files - new: https://git.openjdk.org/jdk/pull/22869/files/24bb39be..7c91d00c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22869&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22869&range=00-01 Stats: 212534 lines in 5089 files changed: 102007 ins; 88290 del; 22237 mod Patch: https://git.openjdk.org/jdk/pull/22869.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22869/head:pull/22869 PR: https://git.openjdk.org/jdk/pull/22869 From duke at openjdk.org Mon Feb 17 16:34:45 2025 From: duke at openjdk.org (Marc Chevalier) Date: Mon, 17 Feb 2025 16:34:45 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor Message-ID: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. Thanks, Marc ------------- Commit messages: - Remove redundant initialization in ciField constructor Changes: https://git.openjdk.org/jdk/pull/23637/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23637&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349180 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23637.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23637/head:pull/23637 PR: https://git.openjdk.org/jdk/pull/23637 From duke at openjdk.org Mon Feb 17 16:34:45 2025 From: duke at openjdk.org (simon) Date: Mon, 17 Feb 2025 16:34:45 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Fri, 14 Feb 2025 15:00:06 GMT, Marc Chevalier wrote: > In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. > > Thanks, > Marc Hello @marc-chevalier! I have already open a PR for this matter. PR is https://github.com/openjdk/jdk/pull/23480. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2660560678 From duke at openjdk.org Mon Feb 17 16:34:51 2025 From: duke at openjdk.org (Marc Chevalier) Date: Mon, 17 Feb 2025 16:34:51 GMT Subject: RFR: 8348172: C2: Remove unused local variables in filter_helper() methods Message-ID: Remove useless locals from `TypeOopPtr::filter_helper` and `TypeKlassPtr::filter_helper`. There were no side effects in their init, so it's fine to remove them. Thanks, Marc ------------- Commit messages: - ci? - Remove unused locals Changes: https://git.openjdk.org/jdk/pull/23629/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23629&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8348172 Stats: 4 lines in 1 file changed: 0 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23629.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23629/head:pull/23629 PR: https://git.openjdk.org/jdk/pull/23629 From jbhateja at openjdk.org Mon Feb 17 16:53:13 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 17 Feb 2025 16:53:13 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 07:52:30 GMT, Nicole Xu wrote: >> Oh, the OCA-verify is still stuck. I'm sorry about that ? >> I pinged my manager @TobiHartmann , he will reach out to see what's the issue. > > Hi @eme64, do you see any risks here? Would you please help to review the patch? Thanks. > @xyyNicole @jatin-bhateja I think it is reasonable to just fix the benchmark so that it still has the same behaviour, just without the out-of-bounds exception. @jatin-bhateja you originally wrote the benchmark, and it could make sense if you fixed it up to what it should be more ideally. @xyyNicole I propose that we file a follow-up RFE to fix the benchmark, and just mention that issue in the benchmark. > > What do you think? Hi @xyyNicole , May I request you to kindly file a follow up RFE mentioning the discussed issues. Performance of lanewise operations works best without any noise if its fed contiguous memory locations. Best Regards, Jatin ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2663639602 From dnsimon at openjdk.org Mon Feb 17 17:11:22 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 17:11:22 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon wrote: >> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. >> >> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that JVMCI is still experimental and only has qualified exports to Graal, I don't think this needs a CSR. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 > - remove non-native-image build time use of ServiceLoader > - make Cleaner.clean public Passes openjdk-pr-canary: https://github.com/dougxc/openjdk-pr-canary/actions/runs/13374826011/job/37351770830#step:4:47 ------------- PR Comment: https://git.openjdk.org/jdk/pull/22869#issuecomment-2663687923 From shade at openjdk.org Mon Feb 17 17:11:49 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 17 Feb 2025 17:11:49 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 Message-ID: Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. Additional testing: - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes - [x] Linux AArch64 server fastdebug, CTW tests still work fine - [ ] Linux AArch64 server fastdebug, `all` ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/23668/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23668&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350159 Stats: 5 lines in 2 files changed: 0 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23668.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23668/head:pull/23668 PR: https://git.openjdk.org/jdk/pull/23668 From kvn at openjdk.org Mon Feb 17 17:32:10 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 17:32:10 GMT Subject: RFR: 8350178: Incorrect comment after JDK-8345580 [v2] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 10:52:00 GMT, SendaoYan wrote: >> Hi all, >> In [JDK-8345580](https://bugs.openjdk.org/browse/JDK-8345580), the const modifier for variable Node::_idx has been removed, but the constant description for variable Node::_idx leave unchange. The related description should be updated. >> >> Only touch the comments, no risk. > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > Remove extra a whitespace Trivial. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23659#pullrequestreview-2621648718 From dnsimon at openjdk.org Mon Feb 17 17:43:14 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 17:43:14 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: On Mon, 23 Dec 2024 18:06:21 GMT, Doug Simon wrote: >> Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 >> - remove non-native-image build time use of ServiceLoader >> - make Cleaner.clean public > > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/services/Services.java line 52: > >> 50: * statement on this field - the guard cannot be behind a method call. >> 51: */ >> 52: public static final boolean IS_BUILDING_NATIVE_IMAGE = Boolean.parseBoolean(VM.getSavedProperty("jdk.vm.ci.services.aot")); > > This field is no longer used in JVMCI and I will remove its usages in Graal. Removed in https://github.com/oracle/graal/pull/10380 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22869#discussion_r1958608248 From yzheng at openjdk.org Mon Feb 17 17:56:15 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Mon, 17 Feb 2025 17:56:15 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: <98jmUmCaXEstTsMZUeuKA1QBro7kZvIZhrFsQWbQIj0=.f4e81caf-78b4-44b8-9d70-b1d68cfc6f7b@github.com> On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon wrote: >> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. >> >> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 > - remove non-native-image build time use of ServiceLoader > - make Cleaner.clean public LGTM ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/22869#pullrequestreview-2621722417 From kvn at openjdk.org Mon Feb 17 18:05:09 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 18:05:09 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 17:04:41 GMT, Aleksey Shipilev wrote: > Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. > > Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes > - [x] Linux AArch64 server fastdebug, CTW tests still work fine > - [ ] Linux AArch64 server fastdebug, `all` src/hotspot/share/compiler/compilationPolicy.cpp line 636: > 634: continue; > 635: } > 636: if (task->is_blocking() && task->compile_reason() == CompileTask::Reason_Whitebox) { This may be not enough - testing with `-Xcomp` sets `-Xbatch`: [arguments.cpp#L1358](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/arguments.cpp#L1358) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23668#discussion_r1958643768 From kvn at openjdk.org Mon Feb 17 18:15:17 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 18:15:17 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 18:02:53 GMT, Vladimir Kozlov wrote: >> Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. >> >> Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. >> >> Additional testing: >> - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes >> - [x] Linux AArch64 server fastdebug, CTW tests still work fine >> - [ ] Linux AArch64 server fastdebug, `all` > > src/hotspot/share/compiler/compilationPolicy.cpp line 636: > >> 634: continue; >> 635: } >> 636: if (task->is_blocking() && task->compile_reason() == CompileTask::Reason_Whitebox) { > > This may be not enough - testing with `-Xcomp` sets `-Xbatch`: [arguments.cpp#L1358](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/arguments.cpp#L1358) On other hand, when running with `-Xcomp` we should not worry about priority of compilation tasks. And `Level2RecompilationTest.java` "requires" `vm.compMode != "Xcomp"` so it should pass. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23668#discussion_r1958649186 From shade at openjdk.org Mon Feb 17 18:15:17 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 17 Feb 2025 18:15:17 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 18:10:05 GMT, Vladimir Kozlov wrote: >> src/hotspot/share/compiler/compilationPolicy.cpp line 636: >> >>> 634: continue; >>> 635: } >>> 636: if (task->is_blocking() && task->compile_reason() == CompileTask::Reason_Whitebox) { >> >> This may be not enough - testing with `-Xcomp` sets `-Xbatch`: [arguments.cpp#L1358](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/arguments.cpp#L1358) > > On other hand, when running with `-Xcomp` we should not worry about priority of compilation tasks. > And `Level2RecompilationTest.java` "requires" `vm.compMode != "Xcomp"` so it should pass. Yes. Testing confirms both the affected test and CTW works fine. I think the cleanest way is to tell `Whitebox` whether we are entering for CTW, or for faking a normal compilation. But it would significantly more intrusive. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23668#discussion_r1958651531 From kvn at openjdk.org Mon Feb 17 18:26:09 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 18:26:09 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 17:04:41 GMT, Aleksey Shipilev wrote: > Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. > > Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes > - [x] Linux AArch64 server fastdebug, CTW tests still work fine > - [ ] Linux AArch64 server fastdebug, `all` I submitted our testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23668#issuecomment-2663844736 From kvn at openjdk.org Mon Feb 17 18:29:09 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 18:29:09 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 18:12:46 GMT, Aleksey Shipilev wrote: >> On other hand, when running with `-Xcomp` we should not worry about priority of compilation tasks. >> And `Level2RecompilationTest.java` "requires" `vm.compMode != "Xcomp"` so it should pass. > > Yes. Testing confirms both the affected test and CTW works fine. > > I think the cleanest way is to tell `Whitebox` whether we are entering for CTW, or for faking a normal compilation. But it would significantly more intrusive. Right. I will test this change. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23668#discussion_r1958662589 From kvn at openjdk.org Mon Feb 17 18:43:23 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 18:43:23 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 17 Feb 2025 06:24:35 GMT, Axel Boldt-Christmas wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove commented lines left by mistake > > src/hotspot/share/code/codeBlob.hpp line 308: > >> 306: >> 307: class Vptr : public CodeBlob::Vptr { >> 308: }; > > Was this needed for some compiler? Or is it to be more explicit about the type hierarchy? Thank you, @xmas92, for review and suggestions. It is second (explicit type hierarchy). I think it should be explicitly declared (even empty) because it is referenced in subclasses to avoid confusion. And it could be useful in a future if we need other virtual methods. Local build with `gcc` on Linux passed without it but I did not try to build on other platforms. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1958673128 From kvn at openjdk.org Mon Feb 17 18:50:18 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 18:50:18 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v10] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> Message-ID: On Thu, 13 Feb 2025 11:55:17 GMT, Boris Ulasevich wrote: >> Looks good. I will submit testing. > >> Looks good. I will submit testing. > > Thank you! > > The change is not yet ready for final testing. I still need to remove my raw access workaround in nmethod::oop_at and rebase onto #23512 once it has been integrated. @bulasevich my an other PR #23533 is ready. It will conflict with your changes. Are you okay if I push it first? ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2663879698 From kvn at openjdk.org Mon Feb 17 19:06:09 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 19:06:09 GMT Subject: RFR: 8348172: C2: Remove unused local variables in filter_helper() methods In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:55:28 GMT, Marc Chevalier wrote: > Remove useless locals from `TypeOopPtr::filter_helper` and `TypeKlassPtr::filter_helper`. There were no side effects in their init, so it's fine to remove them. > > Thanks, > Marc Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23629#pullrequestreview-2621810090 From dnsimon at openjdk.org Mon Feb 17 19:37:24 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 19:37:24 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: <0elzblvKiIjGRnZiBSPjStJpDMTPJyXObkHwVuStSJg=.8ac2fd8e-d38c-42de-a1fa-c94eac144a73@github.com> On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon wrote: >> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. >> >> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 > - remove non-native-image build time use of ServiceLoader > - make Cleaner.clean public Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22869#issuecomment-2663944221 From dnsimon at openjdk.org Mon Feb 17 19:37:24 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 19:37:24 GMT Subject: Integrated: 8346781: [JVMCI] Limit ServiceLoader to class initializers In-Reply-To: References: Message-ID: On Mon, 23 Dec 2024 17:58:23 GMT, Doug Simon wrote: > In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. > > This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed. This pull request has now been integrated. Changeset: 8ec58939 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/8ec589390f7dc67dd883a1efddb8da32790f6591 Stats: 166 lines in 7 files changed: 10 ins; 126 del; 30 mod 8346781: [JVMCI] Limit ServiceLoader to class initializers Reviewed-by: never, yzheng ------------- PR: https://git.openjdk.org/jdk/pull/22869 From kvn at openjdk.org Mon Feb 17 22:49:16 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 22:49:16 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 17:04:41 GMT, Aleksey Shipilev wrote: > Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. > > Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes > - [x] Linux AArch64 server fastdebug, CTW tests still work fine > - [ ] Linux AArch64 server fastdebug, `all` My tier1-4, xcomp, stress testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23668#pullrequestreview-2622052595 From fyang at openjdk.org Tue Feb 18 00:12:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 18 Feb 2025 00:12:10 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v3] In-Reply-To: <3booXWD4LZQaI9r1VftWWjuFUN71kNIrwsgNTXjSkQQ=.d6844332-2d54-4dca-a44e-4e8bf6339e5d@github.com> References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> <3booXWD4LZQaI9r1VftWWjuFUN71kNIrwsgNTXjSkQQ=.d6844332-2d54-4dca-a44e-4e8bf6339e5d@github.com> Message-ID: On Mon, 17 Feb 2025 15:15:27 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch? >> This optimization is mainly for the vector API. >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 >> SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 >> SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 >> SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 >> SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 >> SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 >> SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 >> SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 >> SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 >> SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 >> SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 >> SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 >> >> > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > space Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23614#pullrequestreview-2622109537 From haosun at openjdk.org Tue Feb 18 02:10:16 2025 From: haosun at openjdk.org (Hao Sun) Date: Tue, 18 Feb 2025 02:10:16 GMT Subject: RFR: 8348172: C2: Remove unused local variables in filter_helper() methods In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:55:28 GMT, Marc Chevalier wrote: > Remove useless locals from `TypeOopPtr::filter_helper` and `TypeKlassPtr::filter_helper`. There were no side effects in their init, so it's fine to remove them. > > Thanks, > Marc LGTM ------------- Marked as reviewed by haosun (Committer). PR Review: https://git.openjdk.org/jdk/pull/23629#pullrequestreview-2622268140 From jkarthikeyan at openjdk.org Tue Feb 18 02:37:29 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 18 Feb 2025 02:37:29 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v31] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 04:39:56 GMT, Johannes Graham wrote: >> An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. >> >> In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: >> - Bounds optimization of xor >> - A check for `x ^ x = 0` >> - Explicit testing of xor over booleans. >> >> Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. >> >> --------- >> ### Progress >> - [x] Change must not contain extraneous whitespace >> - [x] Commit message must refer to an issue >> - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) >> >> >> >> ### Reviewers >> * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) >> >> ### Reviewing >>
Using git >> >> Checkout this PR locally: \ >> `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ >> `$ git checkout pull/23089` >> >> Update a local copy of the PR: \ >> `$ git checkout pull/23089` \ >> `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` >> >>
>>
Using Skara CLI tools >> >> Checkout this PR locally: \ >> `$ git pr checkout 23089` >> >> View PR using the GUI difftool: \ >> `$ git pr show -t 23089` >> >>
>>
Using diff file >> >> Download this PR as a diff file: \ >> https://git.openjdk.org/jdk/pull/23089.diff >> >>
>>
Using Webrev >> >> [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-25939... > > Johannes Graham has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: > > - Merge branch 'openjdk:master' into xor_const > - fix variable names in comments > - update test > - address review comments > - formatting, remove commented tests > - add IR tests for long, simplify tests for int > - formatting > - add sanity asserts to tests > - re-add tests > - try fewer tests > - ... and 35 more: https://git.openjdk.org/jdk/compare/ff52859d...16049cdc This looks really nice! I just have 2 small comments here. src/hotspot/share/opto/addnode.cpp line 1032: > 1030: // round_up is safe because high bit is unset (0 <= lo <= hi) > 1031: > 1032: return round_up_power_of_2(U(hi_0 | hi_1) + 1) - 1 ; Suggestion: return round_up_power_of_2(U(hi_0 | hi_1) + 1) - 1; test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 315: > 313: > 314: @Test > 315: public int testXorConstRange(int x, int y) { Should this have an `@IR` test attached, like `testFoldableXor`? ------------- PR Review: https://git.openjdk.org/jdk/pull/23089#pullrequestreview-2621997896 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1958977989 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1958978363 From jwaters at openjdk.org Tue Feb 18 02:39:25 2025 From: jwaters at openjdk.org (Julian Waters) Date: Tue, 18 Feb 2025 02:39:25 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 06:32:56 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux * For target hotspot_variant-server_libjvm_objs_mulnode.o: /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?: /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous 1944 | return TypeH::make(fma(f1, f2, f3)); | ^ In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31, from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28, from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26: /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)? 544 | static const TypeH* make(float f); | ^~~~ /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)? 545 | static const TypeH* make(short f); | ^~~~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2664473623 From duke at openjdk.org Tue Feb 18 02:40:50 2025 From: duke at openjdk.org (Johannes Graham) Date: Tue, 18 Feb 2025 02:40:50 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v32] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: Fix formatting Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/16049cdc..e8fc6dab Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=31 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=30-31 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From cjplummer at openjdk.org Tue Feb 18 03:05:16 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Tue, 18 Feb 2025 03:05:16 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove commented lines left by mistake SA changes look good. Thanks for taking care of this. ------------- Marked as reviewed by cjplummer (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2622331256 From fyang at openjdk.org Tue Feb 18 03:18:11 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 18 Feb 2025 03:18:11 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v3] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:04:24 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? >> This optimization is mainly for the vector API. >> On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). >> >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% >> ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% >> DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% >> DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% >> FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% >> FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% >> IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% >> IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% >> LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% >> LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% >> ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% >> ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% >> >> > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > mimor Thanks for the update. Several more comments after another look. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 2961: > 2959: BasicType bt, uint vector_length, VectorMask vm) { > 2960: assert(bt == T_BYTE || bt == T_SHORT || bt == T_INT || bt == T_LONG, "unsupported element type"); > 2961: uint len = vector_length / type2aelembytes(bt); Why not pass the number of elements directly by param `vector_length`? On the call sites in the ad file, `Matcher::vector_length(this)` gives exactly what you want. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 3001: > 2999: BasicType bt, uint vector_length, VectorMask vm) { > 3000: assert(bt == T_FLOAT || bt == T_DOUBLE, "unsupported element type"); > 3001: uint len = vector_length / type2aelembytes(bt); Similar here. src/hotspot/cpu/riscv/riscv_v.ad line 106: > 104: case Op_MulReductionVF: > 105: case Op_MulReductionVD: > 106: if (vlen < 4) { A code comment about why this required would help understand. src/hotspot/cpu/riscv/riscv_v.ad line 2464: > 2462: %} > 2463: > 2464: instruct reduce_mulL(iRegLNoSp dst, iRegLNoSp isrc, vReg vsrc, Suggestion: `iRegL isrc` src/hotspot/cpu/riscv/riscv_v.ad line 2479: > 2477: %} > 2478: > 2479: instruct reduce_mulL_masked(iRegLNoSp dst, iRegLNoSp isrc, vReg vsrc, Suggestion: `iRegL isrc` ------------- PR Review: https://git.openjdk.org/jdk/pull/23580#pullrequestreview-2622159419 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1958962339 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1959000596 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1958905457 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1959002320 PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1959002749 From duke at openjdk.org Tue Feb 18 03:53:12 2025 From: duke at openjdk.org (Nicole Xu) Date: Tue, 18 Feb 2025 03:53:12 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Wed, 8 Jan 2025 09:04:47 GMT, Nicole Xu wrote: > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: > > > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > > > The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. > > Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. > > Additionally, some defined but unused variables have been removed. Sure. Since I am very new to openJDK, I asked my teammate for help to file the follow-up RFE. Here is the https://bugs.openjdk.org/browse/JDK-8350215 with description of the discussed issues. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2664543916 From aboldtch at openjdk.org Tue Feb 18 06:26:12 2025 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Tue, 18 Feb 2025 06:26:12 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v6] In-Reply-To: <2jrzusvVl-XI8K734YlChq4ObRX75yovTq7mWTf8ZlA=.0e75781a-5d52-4919-ad28-c5e91ec3a47f@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <2jrzusvVl-XI8K734YlChq4ObRX75yovTq7mWTf8ZlA=.0e75781a-5d52-4919-ad28-c5e91ec3a47f@github.com> Message-ID: On Fri, 7 Feb 2025 14:48:51 GMT, Roberto Casta?eda Lozano wrote: >> G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. >> >> The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: >> >> >> o = new MyObject(); >> if (...) { >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the if condition) >> } >> >> >> or in initialization writes placed after exception-throwing checks: >> >> >> o = new MyObject(); >> if (...) { >> throw new Exception(""); >> } >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the above if condition) >> >> >> These patterns are commonly found in Java code, e.g. in the core libraries: >> >> - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or >> >> - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). >> >> The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): >> >> >> Object[] a = new Object[...]; >> for (int i = 0; i < a.length; i++) { >> a[i] = ...; // barrier elided only after this changeset >> } >> >> >> or eliding barriers from array initialization writes with unknown array index: >> >> >> Object[] a = new Object[...]; >> a[index] = ...; // barrier elided only after this changeset >> >> >> The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_inde... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Disable test IR checks for cases where barrier elision analysis fails to elide on s390 The ZGC refactoring lgtm. ------------- Marked as reviewed by aboldtch (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23235#pullrequestreview-2622554105 From fyang at openjdk.org Tue Feb 18 07:48:08 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 18 Feb 2025 07:48:08 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 14:32:07 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > Currently, `string_compare` code is a bit complicated, main reasons include: > 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. > 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. > > This is not good for code reading and maintaining. > > > So, this patch does following refactoring: > 1. merge LU and UL code into one, i.e. remove UL code. > 2. seperate the code into 2 methods: LL/UU and LU/UL. > 3. some other misc improvement. > > I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. > 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. > 2. make `SHORT_STRING` case simpler. > > > > Thanks Hi, there is a difference between LU and UL when calculating the difference [1]. The `tmp1` and `tmp2` here represent characters from the first and second string respectively. So the order of the two strings matters here. But that doesn't seem to be reflected by this change? BTW: Seems more readable if we move the difference calcuation code to `string_compare_long_LU` and `string_compare_long_LL_UU` at the same time. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1571 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23633#issuecomment-2664840634 From duke at openjdk.org Tue Feb 18 08:23:10 2025 From: duke at openjdk.org (duke) Date: Tue, 18 Feb 2025 08:23:10 GMT Subject: RFR: 8348172: C2: Remove unused local variables in filter_helper() methods In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:55:28 GMT, Marc Chevalier wrote: > Remove useless locals from `TypeOopPtr::filter_helper` and `TypeKlassPtr::filter_helper`. There were no side effects in their init, so it's fine to remove them. > > Thanks, > Marc @marc-chevalier Your change (at version bed4a1efe330f565e06a4954ce166ada38f18403) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23629#issuecomment-2664908718 From rcastanedalo at openjdk.org Tue Feb 18 08:29:12 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 08:29:12 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v6] In-Reply-To: References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <2jrzusvVl-XI8K734YlChq4ObRX75yovTq7mWTf8ZlA=.0e75781a-5d52-4919-ad28-c5e91ec3a47f@github.com> Message-ID: On Tue, 18 Feb 2025 06:23:20 GMT, Axel Boldt-Christmas wrote: > The ZGC refactoring lgtm. Thanks, Axel! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23235#issuecomment-2664921534 From chagedorn at openjdk.org Tue Feb 18 08:44:15 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Feb 2025 08:44:15 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v3] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 13:16:57 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Dump alias type information for each node src/hotspot/share/opto/idealGraphPrinter.cpp line 555: > 553: field->print_name_on(&field_stream); > 554: print_prop("alias_field", field_stream.freeze()); > 555: } I'm not sure if this is really required. There is already the "source" and "destination" dump for loads and stores, respectively: https://github.com/openjdk/jdk/blob/3353f8e0875165adbc8ee764a4c8d8817a87cd88/src/hotspot/share/opto/idealGraphPrinter.cpp#L695-L718 This also shows information for array accesses. For example: ![image](https://github.com/user-attachments/assets/0d60fe56-aee8-4450-a680-78a2ac470d4b) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23621#discussion_r1959297113 From chagedorn at openjdk.org Tue Feb 18 08:46:21 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Feb 2025 08:46:21 GMT Subject: RFR: 8348172: C2: Remove unused local variables in filter_helper() methods In-Reply-To: References: Message-ID: <8X0qkKtfPDatDyFiMur7S8MypBpJE185Rq6XKQ63lvY=.70ae2524-0ed6-40fb-9d01-2761ed124d1b@github.com> On Fri, 14 Feb 2025 10:55:28 GMT, Marc Chevalier wrote: > Remove useless locals from `TypeOopPtr::filter_helper` and `TypeKlassPtr::filter_helper`. There were no side effects in their init, so it's fine to remove them. > > Thanks, > Marc Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23629#pullrequestreview-2622813674 From duke at openjdk.org Tue Feb 18 08:46:21 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 18 Feb 2025 08:46:21 GMT Subject: Integrated: 8348172: C2: Remove unused local variables in filter_helper() methods In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:55:28 GMT, Marc Chevalier wrote: > Remove useless locals from `TypeOopPtr::filter_helper` and `TypeKlassPtr::filter_helper`. There were no side effects in their init, so it's fine to remove them. > > Thanks, > Marc This pull request has now been integrated. Changeset: 013fda1d Author: Marc Chevalier Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/013fda1dad22d7aca3ee24c11dc42cb3885b5323 Stats: 4 lines in 1 file changed: 0 ins; 4 del; 0 mod 8348172: C2: Remove unused local variables in filter_helper() methods Reviewed-by: kvn, haosun, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23629 From duke at openjdk.org Tue Feb 18 08:55:21 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 18 Feb 2025 08:55:21 GMT Subject: RFR: 8348172: C2: Remove unused local variables in filter_helper() methods In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 10:55:28 GMT, Marc Chevalier wrote: > Remove useless locals from `TypeOopPtr::filter_helper` and `TypeKlassPtr::filter_helper`. There were no side effects in their init, so it's fine to remove them. > > Thanks, > Marc Thanks all! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23629#issuecomment-2664975143 From rcastanedalo at openjdk.org Tue Feb 18 08:56:44 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 08:56:44 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v2] In-Reply-To: References: Message-ID: > This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. > > Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: > > > java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 > > > produces the following visualization for the `Initial spilling` phase: > > ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) > > Live ranges are first-class IGV entities, meaning that the user can: > > - search, select, and extract them; > > ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) > > - examine their properties in the `Properties` window or via tooltips; > > ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) > > - navigate to related IGV entities via a pop-up menu; and > > ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) > > - program filters that act om them according to their properties. > > ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) > > Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: > > ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c) > > The changeset extends the IGV graph printing logic in HotSpot t... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Check if live range widgets actually exist for computing block visibility ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23558/files - new: https://git.openjdk.org/jdk/pull/23558/files/c5e48e46..00169223 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23558&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23558&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23558.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23558/head:pull/23558 PR: https://git.openjdk.org/jdk/pull/23558 From rcastanedalo at openjdk.org Tue Feb 18 08:56:44 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 08:56:44 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: References: Message-ID: <8ViA6x7l9mMjBEEfKR3LSICyAh4AANl_mq6wP5TEt9Y=.33b9b461-814f-48df-97da-214d0d44e4c3@github.com> On Mon, 17 Feb 2025 13:53:52 GMT, Roberto Casta?eda Lozano wrote: > Thanks for the report Damon, will investigate! Commit 00169223 should fix the issue, thanks again. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2664976626 From chagedorn at openjdk.org Tue Feb 18 09:01:28 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Feb 2025 09:01:28 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Fri, 14 Feb 2025 15:00:06 GMT, Marc Chevalier wrote: > In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. > > Thanks, > Marc Looks good and trivial. > Hello @marc-chevalier! I have already open a PR for this matter. PR is #23480. Hi @gustavosimon, the JBS issue was already assigned to @marc-chevalier. If you intend to work on an issue, please check the following: - The issue is already assigned in JBS? - Reach out to the assignee and ask if the person is currently working on the issue or has intentions to do so. If not, they can reassign it to you or someone else on your behalf (if you don't have a JBS account). - The issue is unassigned in JBS? - Assign the issue to yourself. - If you don't have a JBS account: Reach out to someone who can assign it to him/herself on your behalf. This avoids "stealing" work that was in progress or planned to do later or even worse doing completely duplicated work which is unfortunate. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23637#pullrequestreview-2622854831 PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2664986150 From shade at openjdk.org Tue Feb 18 09:10:44 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Feb 2025 09:10:44 GMT Subject: RFR: 8350210: CTW: Use stackless exceptions Message-ID: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> Looking at reducing CTW costs in our infra, there are a few simple improvements we can take. CTW runners compiling 3rd party JARs normally catch lots of stray exceptions when trying to load non-existing classes, e.g. for resolving the static final fields, or preloading the constant pool. Generating stack traces for these take considerable time, and stack traces for those exceptions are not essential to debug CTW runs. So, we can summarily disable them. This has no effect on `applications/ctw/modules`. Compiling a large 3rd party JAR like `solr-core-7.4.0.jar`, for example, improves from ~15.3s to ~12.5s. ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/23671/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23671&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350210 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23671.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23671/head:pull/23671 PR: https://git.openjdk.org/jdk/pull/23671 From duke at openjdk.org Tue Feb 18 09:19:10 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 18 Feb 2025 09:19:10 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Fri, 14 Feb 2025 15:00:06 GMT, Marc Chevalier wrote: > In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. > > Thanks, > Marc Thanks Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2665031265 From duke at openjdk.org Tue Feb 18 09:19:10 2025 From: duke at openjdk.org (duke) Date: Tue, 18 Feb 2025 09:19:10 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Fri, 14 Feb 2025 15:00:06 GMT, Marc Chevalier wrote: > In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. > > Thanks, > Marc @marc-chevalier Your change (at version 1f37759cbf1712896b26cf054cc676184d67bfd5) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2665034158 From duke at openjdk.org Tue Feb 18 09:29:16 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 18 Feb 2025 09:29:16 GMT Subject: Integrated: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Fri, 14 Feb 2025 15:00:06 GMT, Marc Chevalier wrote: > In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. > > Thanks, > Marc This pull request has now been integrated. Changeset: ff05d979 Author: Marc Chevalier Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/ff05d9795322fee6def559bd6776de42b96c27dc Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod 8349180: Remove redundant initialization in ciField constructor Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23637 From roland at openjdk.org Tue Feb 18 09:35:16 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 09:35:16 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter wrote: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... Would it make sense to add verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard (and maybe whenever there's a multi version guard, loops that are guarded are indeed flagged correctly)? src/hotspot/share/opto/loopTransform.cpp line 751: > 749: // Peeling also destroys the connection of the main loop > 750: // to the multiversion_if. > 751: cl->set_no_multiversion(); Would we want to change the multiversion guard at this point so it constant folds and the slow version is removed? src/hotspot/share/opto/loopUnswitch.cpp line 513: > 511: > 512: // Create new Region. > 513: RegionNode* region = new RegionNode(1); So we create a new `Region` every time a new condition is added? src/hotspot/share/opto/loopnode.cpp line 1097: > 1095: // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so > 1096: // we do a custom check here. > 1097: if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) { Isn't that done by `add_parse_predicate`? src/hotspot/share/opto/traceAutoVectorizationTag.hpp line 32: > 30: > 31: #define COMPILER_TRACE_AUTO_VECTORIZATION_TAG(flags) \ > 32: flags(POINTER_PARSING, "Trace VPointer/MemPointer parsing") \ Has anything changed here? I stared at it a few times and couldn't figure out what has. ------------- PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2622881581 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959338954 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959344256 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959347164 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959349092 From roland at openjdk.org Tue Feb 18 09:48:14 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 09:48:14 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 15:24:44 GMT, Emanuel Peter wrote: > > Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks) > > I suppose that is currently what I'm planning. But we could in principle separate them. But I would leave that for later, if there is any desire to do that. For now, I think it's ok to just go with a single "auto-vectorization" reason. > > Does that sound reasonable? Yes, it sounds reasonable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665104472 From epeter at openjdk.org Tue Feb 18 09:48:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 09:48:16 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com> On Tue, 18 Feb 2025 09:09:15 GMT, Roland Westrelin wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > src/hotspot/share/opto/loopTransform.cpp line 751: > >> 749: // Peeling also destroys the connection of the main loop >> 750: // to the multiversion_if. >> 751: cl->set_no_multiversion(); > > Would we want to change the multiversion guard at this point so it constant folds and the slow version is removed? I suppose we can probably do that. Otherwise, we just have to wait until the `OpaqueMultiversioningNode` constant folds after loop-opts. > src/hotspot/share/opto/loopUnswitch.cpp line 513: > >> 511: >> 512: // Create new Region. >> 513: RegionNode* region = new RegionNode(1); > > So we create a new `Region` every time a new condition is added? Yes. Are you ok with that? Or would you prefer if we extended an existing region (is that possible?) and then we'd have 2 cases, one where there is none yet, and one where we'd extend. I think adding one each time is easier, and it would get commoned anyway, right? > src/hotspot/share/opto/traceAutoVectorizationTag.hpp line 32: > >> 30: >> 31: #define COMPILER_TRACE_AUTO_VECTORIZATION_TAG(flags) \ >> 32: flags(POINTER_PARSING, "Trace VPointer/MemPointer parsing") \ > > Has anything changed here? I stared at it a few times and couldn't figure out what has. I added the tag `SPECULATIVE_RUNTIME_CHECKS`. And then had to change alignment for all others ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959397988 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959392450 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959394676 From epeter at openjdk.org Tue Feb 18 09:51:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 09:51:15 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:14:28 GMT, Roland Westrelin wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > src/hotspot/share/opto/loopnode.cpp line 1097: > >> 1095: // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so >> 1096: // we do a custom check here. >> 1097: if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) { > > Isn't that done by `add_parse_predicate`? @rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959403871 From aturbanov at openjdk.org Tue Feb 18 09:51:15 2025 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Tue, 18 Feb 2025 09:51:15 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v2] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 08:56:44 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. >> >> Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: >> >> >> java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 >> >> >> produces the following visualization for the `Initial spilling` phase: >> >> ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) >> >> Live ranges are first-class IGV entities, meaning that the user can: >> >> - search, select, and extract them; >> >> ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) >> >> - examine their properties in the `Properties` window or via tooltips; >> >> ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) >> >> - navigate to related IGV entities via a pop-up menu; and >> >> ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) >> >> - program filters that act om them according to their properties. >> >> ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) >> >> Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: >> >> ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Check if live range widgets actually exist for computing block visibility src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/ShowLiveRangesAction.java line 34: > 32: import org.openide.util.ImageUtilities; > 33: > 34: public class ShowLiveRangesAction extends AbstractAction implements PropertyChangeListener { Suggestion: public class ShowLiveRangesAction extends AbstractAction implements PropertyChangeListener { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23558#discussion_r1959403130 From epeter at openjdk.org Tue Feb 18 09:56:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 09:56:13 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:32:19 GMT, Roland Westrelin wrote: > Would it make sense to add verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard (and maybe whenever there's a multi version guard, loops that are guarded are indeed flagged correctly)? I'd have to see if that is possible. Well: > verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard That is maybe possible. At least I cannot think of a reason why it should not work right now. Well, maybe what if the predicates get messed up somehow, is that possible? Then you would lose connection. Ah: what if the pre-loop somehow gets "messed up", i.e. that it loses its loop structure. Then we could not really go from the main-loop to the pre-loop to the selector-if any more. > whenever there's a multi version guard, loops that are guarded are indeed flagged correctly That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665123097 From roland at openjdk.org Tue Feb 18 09:56:14 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 09:56:14 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:48:58 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.cpp line 1097: >> >>> 1095: // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so >>> 1096: // we do a custom check here. >>> 1097: if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) { >> >> Isn't that done by `add_parse_predicate`? > > @rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else? Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959411405 From syan at openjdk.org Tue Feb 18 09:59:22 2025 From: syan at openjdk.org (SendaoYan) Date: Tue, 18 Feb 2025 09:59:22 GMT Subject: RFR: 8350178: Incorrect comment after JDK-8345580 [v2] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 10:52:00 GMT, SendaoYan wrote: >> Hi all, >> In [JDK-8345580](https://bugs.openjdk.org/browse/JDK-8345580), the const modifier for variable Node::_idx has been removed, but the constant description for variable Node::_idx leave unchange. The related description should be updated. >> >> Only touch the comments, no risk. > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > Remove extra a whitespace Thanks all for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23659#issuecomment-2665132373 From syan at openjdk.org Tue Feb 18 09:59:23 2025 From: syan at openjdk.org (SendaoYan) Date: Tue, 18 Feb 2025 09:59:23 GMT Subject: Integrated: 8350178: Incorrect comment after JDK-8345580 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 09:28:38 GMT, SendaoYan wrote: > Hi all, > In [JDK-8345580](https://bugs.openjdk.org/browse/JDK-8345580), the const modifier for variable Node::_idx has been removed, but the constant description for variable Node::_idx leave unchange. The related description should be updated. > > Only touch the comments, no risk. This pull request has now been integrated. Changeset: d7baae3e Author: SendaoYan URL: https://git.openjdk.org/jdk/commit/d7baae3ee92bbc94e380703f173a4d4a9de75e29 Stats: 5 lines in 1 file changed: 0 ins; 2 del; 3 mod 8350178: Incorrect comment after JDK-8345580 Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/23659 From roland at openjdk.org Tue Feb 18 10:02:09 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 10:02:09 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v10] In-Reply-To: References: Message-ID: > To optimize a long counted loop and long range checks in a long or int > counted loop, the loop is turned into a loop nest. When the loop has > few iterations, the overhead of having an outer loop whose backedge is > never taken, has a measurable cost. Furthermore, creating the loop > nest usually causes one iteration of the loop to be peeled so > predicates can be set up. If the loop is short running, then it's an > extra iteration that's run with range checks (compared to an int > counted loop with int range checks). > > This change doesn't create a loop nest when: > > 1- it can be determined statically at loop nest creation time that the > loop runs for a short enough number of iterations > > 2- profiling reports that the loop runs for no more than ShortLoopIter > iterations (1000 by default). > > For 2-, a guard is added which is implemented as yet another predicate. > > While this change is in principle simple, I ran into a few > implementation issues: > > - while c2 has a way to compute the number of iterations of an int > counted loop, it doesn't have that for long counted loop. The > existing logic for int counted loops promotes values to long to > avoid overflows. I reworked it so it now works for both long and int > counted loops. > > - I added a new deoptimization reason (Reason_short_running_loop) for > the new predicate. Given the number of iterations is narrowed down > by the predicate, the limit of the loop after transformation is a > cast node that's control dependent on the short running loop > predicate. Because once the counted loop is transformed, it is > likely that range check predicates will be inserted and they will > depend on the limit, the short running loop predicate has to be the > one that's further away from the loop entry. Now it is also possible > that the limit before transformation depends on a predicate > (TestShortRunningLongCountedLoopPredicatesClone is an example), we > can have: new predicates inserted after the transformation that > depend on the casted limit that itself depend on old predicates > added before the transformation. To solve this cicular dependency, > parse and assert predicates are cloned between the old predicates > and the loop head. The cloned short running loop parse predicate is > the one that's used to insert the short running loop predicate. > > - In the case of a long counted loop, the loop is transformed into a > regular loop with a new limit and transformed range checks that's > later turned into an in counted loop. The int ... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: - Merge branch 'master' into JDK-8342692 - TestMemorySegment test fix - test wip - Merge branch 'master' into JDK-8342692 - refactor - Merge branch 'master' into JDK-8342692 - Merge branch 'master' into JDK-8342692 - Merge branch 'master' into JDK-8342692 - Merge branch 'master' into JDK-8342692 - review - ... and 23 more: https://git.openjdk.org/jdk/compare/28e744dc...3df20871 ------------- Changes: https://git.openjdk.org/jdk/pull/21630/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=09 Stats: 1316 lines in 25 files changed: 1254 ins; 16 del; 46 mod Patch: https://git.openjdk.org/jdk/pull/21630.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21630/head:pull/21630 PR: https://git.openjdk.org/jdk/pull/21630 From bulasevich at openjdk.org Tue Feb 18 10:04:19 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 18 Feb 2025 10:04:19 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v10] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> Message-ID: On Thu, 13 Feb 2025 11:55:17 GMT, Boris Ulasevich wrote: >> Looks good. I will submit testing. > >> Looks good. I will submit testing. > > Thank you! > > The change is not yet ready for final testing. I still need to remove my raw access workaround in nmethod::oop_at and rebase onto #23512 once it has been integrated. > @bulasevich my an other PR #23533 is ready. It will conflict with your changes. Are you okay if I push it first? Absolutely! Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2665147449 From epeter at openjdk.org Tue Feb 18 10:07:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 10:07:18 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:53:14 GMT, Roland Westrelin wrote: >> @rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else? > > Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it? @rwestrel So we would check both, right? But is that what we want for all predicates? `C->too_many_traps(reason)` checks against `PerMethodTrapLimit`: if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) { But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this: if (md->has_trap_at(bci, m, reason) != 0) { // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic. // Also, if there are multiple reasons, or if there is no per-BCI record, // assume the worst. So the `bci` check fails if there has been even a single trapping recorded. So it seems that such a change would affect the behavior in ways I cannot yet predict. What do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959431345 From epeter at openjdk.org Tue Feb 18 10:11:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 10:11:16 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: On Tue, 18 Feb 2025 09:57:29 GMT, Roland Westrelin wrote: > > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? >There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity. Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665161507 From roland at openjdk.org Tue Feb 18 10:11:17 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 10:11:17 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> On Tue, 18 Feb 2025 10:04:59 GMT, Emanuel Peter wrote: >> Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it? > > @rwestrel So we would check both, right? But is that what we want for all predicates? > > `C->too_many_traps(reason)` checks against `PerMethodTrapLimit`: > > if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) { > > > But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this: > > if (md->has_trap_at(bci, m, reason) != 0) { > // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic. > // Also, if there are multiple reasons, or if there is no per-BCI record, > // assume the worst. > > So the `bci` check fails if there has been even a single trapping recorded. > > So it seems that such a change would affect the behavior in ways I cannot yet predict. > > What do you think? That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959437628 From luhenry at openjdk.org Tue Feb 18 10:14:14 2025 From: luhenry at openjdk.org (Ludovic Henry) Date: Tue, 18 Feb 2025 10:14:14 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v3] In-Reply-To: <3booXWD4LZQaI9r1VftWWjuFUN71kNIrwsgNTXjSkQQ=.d6844332-2d54-4dca-a44e-4e8bf6339e5d@github.com> References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> <3booXWD4LZQaI9r1VftWWjuFUN71kNIrwsgNTXjSkQQ=.d6844332-2d54-4dca-a44e-4e8bf6339e5d@github.com> Message-ID: On Mon, 17 Feb 2025 15:15:27 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review the patch? >> This optimization is mainly for the vector API. >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 >> SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 >> SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 >> SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 >> SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 >> SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 >> SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 >> SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 >> SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 >> SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 >> SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 >> SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 >> >> > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > space Marked as reviewed by luhenry (Committer). src/hotspot/cpu/riscv/riscv_v.ad line 4: > 2: // Copyright (c) 2020, 2025, Oracle and/or its affiliates. All rights reserved. > 3: // Copyright (c) 2020, 2023, Arm Limited. All rights reserved. > 4: // Copyright (c) 2020, 2022, Huawei Technologies Co., Ltd. All rights reserved. Instead of modifying the Oracle Copyright, we should add one for Rivos. That can be done in a future PR as well. ------------- PR Review: https://git.openjdk.org/jdk/pull/23614#pullrequestreview-2623049316 PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1959441083 From tschatzl at openjdk.org Tue Feb 18 10:18:20 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 18 Feb 2025 10:18:20 GMT Subject: RFR: 8346280: C2: implement late barrier elision for G1 [v6] In-Reply-To: <2jrzusvVl-XI8K734YlChq4ObRX75yovTq7mWTf8ZlA=.0e75781a-5d52-4919-ad28-c5e91ec3a47f@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> <2jrzusvVl-XI8K734YlChq4ObRX75yovTq7mWTf8ZlA=.0e75781a-5d52-4919-ad28-c5e91ec3a47f@github.com> Message-ID: On Fri, 7 Feb 2025 14:48:51 GMT, Roberto Casta?eda Lozano wrote: >> G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. >> >> The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: >> >> >> o = new MyObject(); >> if (...) { >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the if condition) >> } >> >> >> or in initialization writes placed after exception-throwing checks: >> >> >> o = new MyObject(); >> if (...) { >> throw new Exception(""); >> } >> o.myField = ...; // barrier elided only after this changeset >> // (assuming no safepoint in the above if condition) >> >> >> These patterns are commonly found in Java code, e.g. in the core libraries: >> >> - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or >> >> - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). >> >> The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): >> >> >> Object[] a = new Object[...]; >> for (int i = 0; i < a.length; i++) { >> a[i] = ...; // barrier elided only after this changeset >> } >> >> >> or eliding barriers from array initialization writes with unknown array index: >> >> >> Object[] a = new Object[...]; >> a[index] = ...; // barrier elided only after this changeset >> >> >> The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_inde... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Disable test IR checks for cases where barrier elision analysis fails to elide on s390 Marked as reviewed by tschatzl (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23235#pullrequestreview-2623060636 From epeter at openjdk.org Tue Feb 18 10:20:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 10:20:15 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> References: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> Message-ID: On Tue, 18 Feb 2025 10:09:00 GMT, Roland Westrelin wrote: > That code Which code are you referring to? Ah, probably you are talking about `PhaseIdealLoop::add_parse_predicate`, which is using the method wide check. And `GraphKit::add_parse_predicate` actually queries `GraphKit::too_many_traps`, which knows the current `bci()`, and can query the per-bci count. > Would you like me to fix this separately? Yes, please. I definitely don't want to do it in this PR ;) And I don't have as much experience with traps as you do. We'd have to think a little about what cases this affects, and if performance would go up or down in all those cases. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959451204 From mli at openjdk.org Tue Feb 18 10:21:32 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 10:21:32 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v4] In-Reply-To: References: Message-ID: <1y5D_AbJm8ORlp2WP_w7NqL66eycsikh4ow_8jSeDFg=.da29e23f-4c8c-4d53-a9e5-c2824073a46e@github.com> > Hi, > Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? > This optimization is mainly for the vector API. > On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). > > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% > ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% > DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% > DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% > FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% > FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% > IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% > IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% > LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% > LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% > ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% > ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: comments, minor ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23580/files - new: https://git.openjdk.org/jdk/pull/23580/files/0dedc1bc..d97dc14a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=02-03 Stats: 41 lines in 3 files changed: 9 ins; 18 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/23580.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23580/head:pull/23580 PR: https://git.openjdk.org/jdk/pull/23580 From mli at openjdk.org Tue Feb 18 10:21:32 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 10:21:32 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v3] In-Reply-To: References: Message-ID: <4fhBovVbAhjQhVhYDSf4XCr-JWoeQTcT6TVk-zM3gdY=.490462ed-5d03-4e17-8bee-e1354aeea250@github.com> On Tue, 18 Feb 2025 03:15:56 GMT, Fei Yang wrote: > Thanks for the update. Several more comments after another look. Thanks, all fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23580#issuecomment-2665191049 From mli at openjdk.org Tue Feb 18 10:24:32 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 10:24:32 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v5] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? > This optimization is mainly for the vector API. > On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). > > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% > ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% > DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% > DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% > FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% > FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% > IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% > IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% > LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% > LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% > ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% > ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: copyright ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23580/files - new: https://git.openjdk.org/jdk/pull/23580/files/d97dc14a..f4c24e69 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=03-04 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23580.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23580/head:pull/23580 PR: https://git.openjdk.org/jdk/pull/23580 From rcastanedalo at openjdk.org Tue Feb 18 10:26:20 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 10:26:20 GMT Subject: Integrated: 8346280: C2: implement late barrier elision for G1 In-Reply-To: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> References: <3eOK-nFYQbKn1w81CWHUY14wk0gyWMT5ULHgZ-ih5-w=.8be51ad0-f412-4aad-b73a-436ccdb8181a@github.com> Message-ID: On Wed, 22 Jan 2025 15:20:19 GMT, Roberto Casta?eda Lozano wrote: > G1 barriers can be safely elided from writes to newly allocated objects as long as no safepoint is taken between the allocation and the write. This changeset complements early G1 barrier elision (performed by the platform-independent phases of C2, and limited to writes immediately following allocations) with a more general elision pass done at a late stage. > > The late elision pass exploits that it runs at a stage where the relative order of memory accesses and safepoints cannot change anymore to elide barriers from initialization writes that do not immediately follow the corresponding allocation, e.g. in conditional initialization writes: > > > o = new MyObject(); > if (...) { > o.myField = ...; // barrier elided only after this changeset > // (assuming no safepoint in the if condition) > } > > > or in initialization writes placed after exception-throwing checks: > > > o = new MyObject(); > if (...) { > throw new Exception(""); > } > o.myField = ...; // barrier elided only after this changeset > // (assuming no safepoint in the above if condition) > > > These patterns are commonly found in Java code, e.g. in the core libraries: > > - [conditional initialization](https://github.com/openjdk/jdk/blob/25fecaaf87400af535c242fe50296f1f89ceeb16/src/java.base/share/classes/java/lang/String.java#L4850), or > > - [initialization after exception-throwing checks (in the superclass constructor)](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/nio/X-Buffer.java.template#L324). > > The optimization also enhances barrier elision for array initialization writes, for example eliding barriers from small array initialization loops (for which safepoints are not inserted): > > > Object[] a = new Object[...]; > for (int i = 0; i < a.length; i++) { > a[i] = ...; // barrier elided only after this changeset > } > > > or eliding barriers from array initialization writes with unknown array index: > > > Object[] a = new Object[...]; > a[index] = ...; // barrier elided only after this changeset > > > The logic used to perform this additional barrier elision is a subset of a pre-existing ZGC-specific optimization. This changeset simply reuses the relevant subset (barrier elision for writes to newly-allocated objects) by extracting the core of the optimization logic from `zBarrierSetC2.cpp` into the GC-shared file `barrierSetC2.cpp`. The functions `block_has_safepoint`, `block_index`, `look_through_node`, `is_{undefined|unknown|concrete}`, `get_base_and_offset`, `is_array... This pull request has now been integrated. Changeset: 8193e0d5 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/8193e0d53ac806d6974e2aacc7b7476aeb52a5fd Stats: 957 lines in 9 files changed: 669 ins; 264 del; 24 mod 8346280: C2: implement late barrier elision for G1 Reviewed-by: tschatzl, aboldtch, mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/23235 From mli at openjdk.org Tue Feb 18 10:27:16 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 10:27:16 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v2] In-Reply-To: References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: On Mon, 17 Feb 2025 14:36:27 GMT, Fei Yang wrote: > LGTM. Thanks for the update. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23614#issuecomment-2665202826 From mli at openjdk.org Tue Feb 18 10:27:18 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 10:27:18 GMT Subject: RFR: 8349908: RISC-V: C2 SelectFromTwoVector [v3] In-Reply-To: References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> <3booXWD4LZQaI9r1VftWWjuFUN71kNIrwsgNTXjSkQQ=.d6844332-2d54-4dca-a44e-4e8bf6339e5d@github.com> Message-ID: On Tue, 18 Feb 2025 10:11:12 GMT, Ludovic Henry wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> space > > src/hotspot/cpu/riscv/riscv_v.ad line 4: > >> 2: // Copyright (c) 2020, 2025, Oracle and/or its affiliates. All rights reserved. >> 3: // Copyright (c) 2020, 2023, Arm Limited. All rights reserved. >> 4: // Copyright (c) 2020, 2022, Huawei Technologies Co., Ltd. All rights reserved. > > Instead of modifying the Oracle Copyright, we should add one for Rivos. That can be done in a future PR as well. Thanks for revewing! I've added the copyright in https://github.com/openjdk/jdk/pull/23580. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23614#discussion_r1959459349 From mli at openjdk.org Tue Feb 18 10:27:18 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 10:27:18 GMT Subject: Integrated: 8349908: RISC-V: C2 SelectFromTwoVector In-Reply-To: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> References: <6cfYyCx9DCPGhjoVKX9oTWRs3kuDnkovGRA6FJh9lq8=.097b6dcf-1e67-4b4e-8695-001e70e8dcbb@github.com> Message-ID: On Thu, 13 Feb 2025 14:20:40 GMT, Hamlin Li wrote: > Hi, > Can you help to review the patch? > This optimization is mainly for the vector API. > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score -master | Error - master | Score - patch | Error - patch | Units | Improvement (master / patch) > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > SelectFromBenchmark.selectFromByteVector | 1024 | avgt | 10 | 26422.495 | 674.565 | 721.604 | 1.036 | ns/op | 36.616 > SelectFromBenchmark.selectFromByteVector | 2048 | avgt | 10 | 53964.411 | 1751.618 | 1385.24 | 0.956 | ns/op | 38.957 > SelectFromBenchmark.selectFromDoubleVector | 1024 | avgt | 10 | 218430.616 | 1369.409 | 7739.774 | 14.408 | ns/op | 28.222 > SelectFromBenchmark.selectFromDoubleVector | 2048 | avgt | 10 | 387889.456 | 7889.791 | 16197.77 | 66.775 | ns/op | 23.947 > SelectFromBenchmark.selectFromFloatVector | 1024 | avgt | 10 | 103483.717 | 492.525 | 3580.358 | 29.127 | ns/op | 28.903 > SelectFromBenchmark.selectFromFloatVector | 2048 | avgt | 10 | 226125.02 | 3118.836 | 7797.025 | 4.346 | ns/op | 29.001 > SelectFromBenchmark.selectFromIntVector | 1024 | avgt | 10 | 97007.999 | 2607.711 | 2898.38 | 0.833 | ns/op | 33.47 > SelectFromBenchmark.selectFromIntVector | 2048 | avgt | 10 | 222303.308 | 3096.615 | 6398.214 | 30.345 | ns/op | 34.745 > SelectFromBenchmark.selectFromLongVector | 1024 | avgt | 10 | 245033.436 | 1652.527 | 6307.773 | 24.597 | ns/op | 38.846 > SelectFromBenchmark.selectFromLongVector | 2048 | avgt | 10 | 438503.547 | 5972.265 | 17215.996 | 167.442 | ns/op | 25.471 > SelectFromBenchmark.selectFromShortVector | 1024 | avgt | 10 | 53632.502 | 2159.433 | 1418.215 | 2.953 | ns/op | 37.817 > SelectFromBenchmark.selectFromShortVector | 2048 | avgt | 10 | 111764.327 | 1220.509 | 3061.386 | 14.716 | ns/op | 36.508 > > This pull request has now been integrated. Changeset: 885be2ef Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/885be2efa6b1359a7c7ab36882e19a7eaba77fb3 Stats: 36 lines in 1 file changed: 35 ins; 0 del; 1 mod 8349908: RISC-V: C2 SelectFromTwoVector Reviewed-by: fyang, luhenry ------------- PR: https://git.openjdk.org/jdk/pull/23614 From epeter at openjdk.org Tue Feb 18 10:29:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 10:29:12 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> References: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> Message-ID: On Tue, 18 Feb 2025 10:09:00 GMT, Roland Westrelin wrote: >> @rwestrel So we would check both, right? But is that what we want for all predicates? >> >> `C->too_many_traps(reason)` checks against `PerMethodTrapLimit`: >> >> if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) { >> >> >> But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this: >> >> if (md->has_trap_at(bci, m, reason) != 0) { >> // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic. >> // Also, if there are multiple reasons, or if there is no per-BCI record, >> // assume the worst. >> >> So the `bci` check fails if there has been even a single trapping recorded. >> >> So it seems that such a change would affect the behavior in ways I cannot yet predict. >> >> What do you think? > > That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately? @rwestrel do you consider that a blocking issue for this PR here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959463556 From roland at openjdk.org Tue Feb 18 10:29:13 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 10:29:13 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> Message-ID: On Tue, 18 Feb 2025 10:25:08 GMT, Emanuel Peter wrote: >> That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately? > > @rwestrel do you consider that a blocking issue for this PR here? No ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959465988 From rcastanedalo at openjdk.org Tue Feb 18 10:31:26 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 10:31:26 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v3] In-Reply-To: References: Message-ID: > This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. > > Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: > > > java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 > > > produces the following visualization for the `Initial spilling` phase: > > ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) > > Live ranges are first-class IGV entities, meaning that the user can: > > - search, select, and extract them; > > ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) > > - examine their properties in the `Properties` window or via tooltips; > > ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) > > - navigate to related IGV entities via a pop-up menu; and > > ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) > > - program filters that act om them according to their properties. > > ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) > > Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: > > ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c) > > The changeset extends the IGV graph printing logic in HotSpot t... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Remove unnecessary whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23558/files - new: https://git.openjdk.org/jdk/pull/23558/files/00169223..08ee449e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23558&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23558&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23558.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23558/head:pull/23558 PR: https://git.openjdk.org/jdk/pull/23558 From rcastanedalo at openjdk.org Tue Feb 18 10:31:27 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 10:31:27 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v2] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:48:41 GMT, Andrey Turbanov wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Check if live range widgets actually exist for computing block visibility > > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/ShowLiveRangesAction.java line 34: > >> 32: import org.openide.util.ImageUtilities; >> 33: >> 34: public class ShowLiveRangesAction extends AbstractAction implements PropertyChangeListener { > > Suggestion: > > public class ShowLiveRangesAction extends AbstractAction implements PropertyChangeListener { Done (commit 08ee449e), thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23558#discussion_r1959466940 From shade at openjdk.org Tue Feb 18 11:15:00 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Feb 2025 11:15:00 GMT Subject: RFR: 8350211: CTW: Attempt to preload all classes in constant pool Message-ID: CTW runners do preloading for constant pools ahead of time. I believe this is done to expose more loaded classes to the compilations, so to extend the compilation scope. Unfortunately, current code catches the first exception when loading the constant pool and stops preloading. This routinely happens when CTW runner processes a 3rd party JAR, where dependencies might normally be in other JARs. I believe we should attempt to resolve all constant pool entries when preloading is requested. This would likely expand the scope of CTW testing. Additional testing: - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` still passes ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/23673/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23673&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350211 Stats: 12 lines in 1 file changed: 4 ins; 5 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23673.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23673/head:pull/23673 PR: https://git.openjdk.org/jdk/pull/23673 From mli at openjdk.org Tue Feb 18 11:15:48 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 11:15:48 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v2] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > Currently, `string_compare` code is a bit complicated, main reasons include: > 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. > 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. > > This is not good for code reading and maintaining. > > > So, this patch does following refactoring: > 1. merge LU and UL code into one, i.e. remove UL code. > 2. seperate the code into 2 methods: LL/UU and LU/UL. > 3. some other misc improvement. > > I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. > 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. > 2. make `SHORT_STRING` case simpler. > > > > Thanks Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: fix temp registers; move code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23633/files - new: https://git.openjdk.org/jdk/pull/23633/files/eaa34661..543c8635 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=00-01 Stats: 77 lines in 2 files changed: 36 ins; 18 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/23633.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23633/head:pull/23633 PR: https://git.openjdk.org/jdk/pull/23633 From mli at openjdk.org Tue Feb 18 11:15:48 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 11:15:48 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 14:32:07 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > Currently, `string_compare` code is a bit complicated, main reasons include: > 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. > 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. > > This is not good for code reading and maintaining. > > > So, this patch does following refactoring: > 1. merge LU and UL code into one, i.e. remove UL code. > 2. seperate the code into 2 methods: LL/UU and LU/UL. > 3. some other misc improvement. > > I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. > 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. > 2. make `SHORT_STRING` case simpler. > > > > Thanks Thanks for catching and suggestion, I'll modify it accordingly. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23633#issuecomment-2665333128 From aph at openjdk.org Tue Feb 18 11:22:27 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 18 Feb 2025 11:22:27 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v9] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: <4Bv_2pFcTm5UCdSvlOq2igoZa8Ccjl7z44iYFjic3J0=.38e8c15f-27c3-4f56-bda2-72236809c5b6@github.com> On Thu, 13 Feb 2025 01:32:04 GMT, Boris Ulasevich wrote: > > Do you want compressed OOPs to be moved out of CodeCache as well as uncompressed OOPs? If so, you should change `loadConNNode`in C2. > > I have moved the OOPs table out of the CodeCache, but its contents remain unchanged - it still holds compressed or uncompressed pointers in the CodeHeap. As I understand it, I only need to adjust how OOPs are accessed without modifying anything else. With ShenandoahGC, I pass jtreg tests with UseCompressedOops both enabled and disabled, so there doesn?t seem to be any issue. Please let me know if I?m mistaken. No, you're right. I misunderstood the scope of this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2665351249 From duke at openjdk.org Tue Feb 18 11:38:18 2025 From: duke at openjdk.org (simon) Date: Tue, 18 Feb 2025 11:38:18 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Tue, 18 Feb 2025 08:56:42 GMT, Christian Hagedorn wrote: >> In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. >> >> Thanks, >> Marc > >> Hello @marc-chevalier! I have already open a PR for this matter. PR is #23480. > > Hi @gustavosimon, the JBS issue was already assigned to @marc-chevalier. If you intend to work on an issue, please check the following: > - The issue is already assigned in JBS? > - Reach out to the assignee and ask if the person is currently working on the issue or has intentions to do so. If not, they can reassign it to you or someone else on your behalf (if you don't have a JBS account). > - The issue is unassigned in JBS? > - Assign the issue to yourself. > - If you don't have a JBS account: Reach out to someone who can assign it to him/herself on your behalf. > > > This avoids "stealing" work that was in progress or planned to do later or even worse doing completely duplicated work which is unfortunate. @chhagedorn Got it. Actually, when I started working on this, the issue was unassigned. I will ask to @RealCLanger to assign it to me next times. Can you review my OCA verification? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2665410365 From fyang at openjdk.org Tue Feb 18 11:39:11 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 18 Feb 2025 11:39:11 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v5] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 10:24:32 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? >> This optimization is mainly for the vector API. >> On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). >> >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% >> ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% >> DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% >> DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% >> FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% >> FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% >> IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% >> IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% >> LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% >> LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% >> ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% >> ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% >> >> > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > copyright src/hotspot/cpu/riscv/riscv_v.ad line 2494: > 2492: > 2493: instruct reduce_mulF(fRegF dst, fRegF fsrc, vReg vsrc, > 2494: vReg tmp1, vReg tmp2) %{ Seems to me that we need a `predicate(!n->as_Reduction()->requires_strict_order())` for all these newly-added floating-point reduce multiply matched rules. The reason is that your `reduce_mul_fp_v` doesn't respect the order of the operands. And we should better rename the name of the match rules to something like `unordered_reduce_mulF`. Reference: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L5281 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1959566012 From rcastanedalo at openjdk.org Tue Feb 18 12:13:45 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 12:13:45 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v4] In-Reply-To: References: Message-ID: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> > This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: > > ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) > > #### Testing > > - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > > - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Remove redundant 'alias_field' property ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23621/files - new: https://git.openjdk.org/jdk/pull/23621/files/af144195..cd2645cf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23621&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23621&range=02-03 Stats: 6 lines in 1 file changed: 0 ins; 4 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23621.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23621/head:pull/23621 PR: https://git.openjdk.org/jdk/pull/23621 From rcastanedalo at openjdk.org Tue Feb 18 12:17:11 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 12:17:11 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v3] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 08:41:55 GMT, Christian Hagedorn wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Dump alias type information for each node > > src/hotspot/share/opto/idealGraphPrinter.cpp line 555: > >> 553: field->print_name_on(&field_stream); >> 554: print_prop("alias_field", field_stream.freeze()); >> 555: } > > I'm not sure if this is really required. There is already the "source" and "destination" dump for loads and stores, respectively: > > https://github.com/openjdk/jdk/blob/3353f8e0875165adbc8ee764a4c8d8817a87cd88/src/hotspot/share/opto/idealGraphPrinter.cpp#L695-L718 > > This also shows information for array accesses. For example: > ![image](https://github.com/user-attachments/assets/0d60fe56-aee8-4450-a680-78a2ac470d4b) You are right, thanks, I overlooked the `source` and `destination` properties. Commit cd2645cf removes `alias_field`. The example visualization shown above can be obtained using something like `[idx] [name] ([alias_index] : [source][destination])` as a "Node Text" value. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23621#discussion_r1959628944 From roland at openjdk.org Tue Feb 18 12:33:12 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 12:33:12 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop [v2] In-Reply-To: <52OYoC5__FdcN8OLwVgdNlb6Fz_IFo8UyKy3GUp5DiM=.708f1ee8-dbbb-4abf-8de0-d94b3b1e2ef6@github.com> References: <52OYoC5__FdcN8OLwVgdNlb6Fz_IFo8UyKy3GUp5DiM=.708f1ee8-dbbb-4abf-8de0-d94b3b1e2ef6@github.com> Message-ID: On Fri, 14 Feb 2025 18:24:25 GMT, Quan Anh Mai wrote: > For this bug, I think a more general fix is to try to compare the type of the `Phi` with that of the input it is going to be replaced with. If the former is not wider than the latter then we add a `CastNode`, since the cast is only about value range, not strict dependency, we can use `CarryDependency` instead of `UnconditionalDependency`. Am I right? I don't think so. Consider this test case: ublic class TestMainLoopNoBackedgeFloatingDiv { public static void main(String[] args) { for (int i = 0; i < 20_000; i++) { test1(1000, 0, false); test1Helper(1000, 0, false, false); } test1(1, 0, false); } private static int test1(int stop, int res, boolean alwaysTrueInMain) { stop = Integer.max(stop, 1); res = test1Helper(stop, res, alwaysTrueInMain, true); return res; } private static int test1Helper(int stop, int res, boolean alwaysTrueInMain, boolean flag) { for (int i = stop; i >= 1; i--) { res = res / i; if (alwaysTrueInMain) { break; } alwaysTrueInMain = flag; } return res; } } It reproduces this issue and is actually a better test case because it doesn't even need `StressGCM`: All we know about the loop `Phi` is that it's expected to be in `[1, max]`. That allows the removal of the control dependency on the `Div` node. pre/main/post loops are then created and then because `alwaysTrueInMain` is true in the main and post loops, the main and post loops loose there backedge. All that's left from their loop body is: res = res / (i-1); (one copy for each one). The 2 `Div` nodes are commoned and placed above the checks that were added to guard the main and post loops and it's always executed out of the pre loop. Depending on the value of `stop`, the test may execute the pre loop and what's left of the main or post loops or only the pre loop. With `stop = 1`, only the pre loop needs to be executed, but the `Div` is executed anyway and the crash occurs. The type of the input to the main loop `Phi` is the same as the type of the loop `Phi` in this case. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23617#issuecomment-2665574210 From dlunden at openjdk.org Tue Feb 18 12:44:31 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 18 Feb 2025 12:44:31 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v13] In-Reply-To: References: <-KC83AQUZqpnDH0lG1yvKaVUV3H5kSN8cQLU1x4e06o=.dbe58228-89fe-4128-9cdb-3953d9215d59@github.com> Message-ID: On Tue, 11 Feb 2025 18:51:10 GMT, Daniel Lund?n wrote: >>> It does look like the problematic memory subgraph results due to loop peeling >> >> OK, that sounds promising! Maybe it is indeed possible to make peeling/cloning maintain our invariant right from the start, and hope (and verify) it is not broken by other transformations. Up to you whether to integrate this point fix and continue your investigation separately or wait until you have explored along this line before integration. >> >>> Sounds like a great idea, but I think we need to discuss the details further first. It is not quite clear to me yet what it is we want to assert. >> >> Right, the details are not obvious to me either, it would probably require some exploration before we can formalize what it is exactly that we want to verify, since there is no specification (as far as I know) of what is expected for the memory subgraph in terms of liveness and interference. > >> Up to you whether to integrate this point fix and continue your investigation separately or wait until you have explored along this line before integration. > > I have considered and implemented a couple of alternative fixes today, but they are not really more elegant than the fix in this PR. If I want to fix the memory graph at loop cloning, what I'm really doing is duplicating the Phi idealization that we already have. So, then I think it would make most sense to work out the combinatorial issues with option 2 that I posted above (for making the Phi idealization less restrictive). I'm leaning towards integrating this for now, but will explore a bit further first. > >> Right, the details are not obvious to me either, it would probably require some exploration before we can formalize what it is exactly that we want to verify, since there is no specification (as far as I know) of what is expected for the memory subgraph in terms of liveness and interference. > > Let's discuss this offline! I have managed to make the split-through-MergeMem Phi idealization discussed above less conservative. From the perspective of the current issue, it is still a point fix, but a more elegant one. Because of this, I'm closing this PR and will open a new PR with the new fix. Thanks for the reviews again. I'll ping everyone involved in the new PR as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22852#discussion_r1959674237 From dlunden at openjdk.org Tue Feb 18 12:44:31 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 18 Feb 2025 12:44:31 GMT Subject: Withdrawn: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges In-Reply-To: References: Message-ID: On Fri, 20 Dec 2024 18:26:55 GMT, Daniel Lund?n wrote: > When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks. > > #### Example 1 > > Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 _after_ a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value). > > ![failure-graph-1](https://github.com/user-attachments/assets/e5458646-7a5c-40e1-b1d8-e3f101e29b73) > ![failure-blocks-1](https://github.com/user-attachments/assets/a0b1f724-0809-4b2f-9feb-93e9c59a5d6a) > > #### Example 2 > > There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9. > > ![failure-graph-2](https://github.com/user-attachments/assets/ede0c299-6251-4ff8-8b84-af40a1ee9e8c) > ![failure-blocks-2](https://github.com/user-attachments/assets/e5a87e43-b6fe-4fa3-8961-54752f63633e) > > ### Changeset > > - Update `PhaseCFG::insert... This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/22852 From roland at openjdk.org Tue Feb 18 13:05:23 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 13:05:23 GMT Subject: RFR: 8347040: C2: assert(!loop->_body.contains(in)) failed Message-ID: <6N-KwwoN-BWfSym13VC96nbLEr13eplzOO0S-s78Ihs=.c8e80a48-3c5f-45ca-ad22-f7b2bda3b48c@github.com> `OuterStripMinedLoopNode::transform_to_counted_loop()` merges the outer strip mined loop and the inner loop into a single loop. To achieve that, it needs to append the nodes of the outer strip mined loop to the body of the inner loop. To make sure each of these nodes is appended only once, a `Unique_Node_List` is used: nodes found by following the safepoint's inputs are first enqueued into the list and then, each unique node should be added to the loop body. That's not what the current code does, though, because it enqueues the nodes it finds to the list and add them to the loop body right away. ------------- Commit messages: - fix & test Changes: https://git.openjdk.org/jdk/pull/23676/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23676&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8347040 Stats: 59 lines in 2 files changed: 57 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23676.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23676/head:pull/23676 PR: https://git.openjdk.org/jdk/pull/23676 From dfenacci at openjdk.org Tue Feb 18 13:47:12 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 18 Feb 2025 13:47:12 GMT Subject: RFR: 8341976: C2: use_mem_state != load->find_exact_control(load->in(0)) assert failure [v2] In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 08:45:34 GMT, Roland Westrelin wrote: >> The `arraycopy` writes to a non escaping array so its `ArrayCopy` node >> is marked as having a narrow memory effect. One of the loads from the >> destination after the copy is transformed into a load from the source >> array (the rationale being that if there's no load from the >> destination of the copy, the `arraycopy` is not needed). The load from >> the source has the input memory state of the `ArrayCopy` as memory >> input. That load is then sunk out of the loop and its control is >> updated to be after the `ArrayCopy`. That's legal because the >> `ArrayCopy` only has a narrow memory effect and can't modify the >> source. The `ArrayCopy` can't be eliminated and is expanded. In the >> process, a `MemBar` that has a wide memory effect is added. The load >> from the source has control after the membar but memory state before >> and because the membar has a wide memory effect, the load is anti >> dependent on the membar: the graph is broken (the load can't be pinned >> after the membar and anti dependent on it). >> >> In short, the problem is that the graph is transformed under the >> assumption that the `ArrayCopy` has a narrow effect but the >> `ArrayCopy` is expanded to a subgraph that has a wide memory >> effect. The fix I propose is to not insert a membar with a wide memory >> effect. We still need a membar when the destination is non escaping >> because the expanded `ArrayCopy`, if it writes to a tighly allocated >> array, writes to raw memory and not to the destination memory slice. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Thanks for the fix @rwestrel. I noticed that the test `compiler/arraycopy/TestArrayCopyOverflowInBoundChecks.java` seems to be failing on linux x64 with the same `assert(use_mem_state != load->find_exact_control(load->in(0))) failed: dependence cycle found` error **after** your fix: # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (src/hotspot/share/opto/gcm.cpp:916), pid=1506939, tid=1506963 # assert(use_mem_state != load->find_exact_control(load->in(0))) failed: dependence cycle found # Current CompileTask: C2:293 90 b 4 compiler.arraycopy.TestArrayCopyOverflowInBoundChecks::test (19 bytes) Stack: [0x000076bd9f800000,0x000076bd9f900000], sp=0x000076bd9f8fb560, free space=1005k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xde84e9] PhaseCFG::insert_anti_dependences(Block*, Node*, bool)+0x1f69 (gcm.cpp:916) V [libjvm.so+0xde8fa9] PhaseCFG::schedule_late(VectorSet&, Node_Stack&)+0x8c9 (gcm.cpp:1535) V [libjvm.so+0xde9903] PhaseCFG::global_code_motion()+0x403 (gcm.cpp:1649) V [libjvm.so+0xdea676] PhaseCFG::do_global_code_motion()+0x66 (gcm.cpp:1778) V [libjvm.so+0xa55a00] Compile::Code_Gen()+0x3c0 (compile.cpp:2952) V [libjvm.so+0xa5899f] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c9f (compile.cpp:881) V [libjvm.so+0x8a40b5] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1d5 (c2compiler.cpp:141) V [libjvm.so+0xa65068] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x928 (compileBroker.cpp:2317) V [libjvm.so+0xa65da8] CompileBroker::compiler_thread_loop()+0x528 (compileBroker.cpp:1975) V [libjvm.so+0xf2a1fe] JavaThread::thread_main_inner()+0xee (javaThread.cpp:776) V [libjvm.so+0x187dbb6] Thread::call_run()+0xb6 (thread.cpp:231) V [libjvm.so+0x1556338] thread_native_entry(Thread*)+0x128 (os_linux.cpp:877) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23465#issuecomment-2665761027 From dnsimon at openjdk.org Tue Feb 18 14:15:55 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 18 Feb 2025 14:15:55 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out Message-ID: [JDK-8346781](https://bugs.openjdk.org/browse/JDK-8346781) changed `JVMCIServiceLocator` such that set of providers is computed eagerly in `JVMCIServiceLocator.`. There are some JVMCI test classes that directly subclassed `JVMCIServiceLocator` which meant deadlock could occur between the main thread running the test and a JVMCI compiler thread. This PR fixes this by ensuring that `JVMCIServiceLocator` providers are separate classes from the main test class. ------------- Commit messages: - avoid JVMCIServiceLocator related deadlock Changes: https://git.openjdk.org/jdk/pull/23679/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23679&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350263 Stats: 60 lines in 4 files changed: 21 ins; 12 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/23679.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23679/head:pull/23679 PR: https://git.openjdk.org/jdk/pull/23679 From yzheng at openjdk.org Tue Feb 18 14:27:16 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 18 Feb 2025 14:27:16 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 14:11:04 GMT, Doug Simon wrote: > [JDK-8346781](https://bugs.openjdk.org/browse/JDK-8346781) changed `JVMCIServiceLocator` such that set of providers is computed eagerly in `JVMCIServiceLocator.`. There are some JVMCI test classes that directly subclassed `JVMCIServiceLocator` which meant deadlock could occur between the main thread running the test and a JVMCI compiler thread. This PR fixes this by ensuring that `JVMCIServiceLocator` providers are separate classes from the main test class. test/hotspot/jtreg/compiler/jvmci/TestUncaughtErrorInCompileMethod.java line 89: > 87: "-XX:+UnlockExperimentalVMOptions", > 88: "-XX:+UseJVMCICompiler", "-Djvmci.Compiler=ErrorCompiler", > 89: "-XX:-UseJVMCINativeLibrary", do we support this in a jdk build? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23679#discussion_r1959845541 From dnsimon at openjdk.org Tue Feb 18 14:31:10 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 18 Feb 2025 14:31:10 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: References: Message-ID: <_W-SlkWqnlGecmjOwEljPAfoSHp4CpwgnkjgJotYfGE=.b361c87e-3609-4011-9f21-e0c3b9b70b4c@github.com> On Tue, 18 Feb 2025 14:20:02 GMT, Yudi Zheng wrote: >> [JDK-8346781](https://bugs.openjdk.org/browse/JDK-8346781) changed `JVMCIServiceLocator` such that set of providers is computed eagerly in `JVMCIServiceLocator.`. There are some JVMCI test classes that directly subclassed `JVMCIServiceLocator` which meant deadlock could occur between the main thread running the test and a JVMCI compiler thread. This PR fixes this by ensuring that `JVMCIServiceLocator` providers are separate classes from the main test class. > > test/hotspot/jtreg/compiler/jvmci/TestUncaughtErrorInCompileMethod.java line 89: > >> 87: "-XX:+UnlockExperimentalVMOptions", >> 88: "-XX:+UseJVMCICompiler", "-Djvmci.Compiler=ErrorCompiler", >> 89: "-XX:-UseJVMCINativeLibrary", > > do we support this in a jdk build? Not sure I understand the question: `UseJVMCINativeLibrary` is a JVMCI flag just like all other JVMCI flags. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23679#discussion_r1959863432 From mli at openjdk.org Tue Feb 18 14:33:52 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 14:33:52 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v6] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? > This optimization is mainly for the vector API. > On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). > > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% > ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% > DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% > DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% > FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% > FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% > IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% > IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% > LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% > LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% > ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% > ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% > > Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: - fix unordered - fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23580/files - new: https://git.openjdk.org/jdk/pull/23580/files/f4c24e69..b6882221 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=04-05 Stats: 25 lines in 1 file changed: 9 ins; 0 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/23580.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23580/head:pull/23580 PR: https://git.openjdk.org/jdk/pull/23580 From mli at openjdk.org Tue Feb 18 14:33:52 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 18 Feb 2025 14:33:52 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v5] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 11:35:43 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> copyright > > src/hotspot/cpu/riscv/riscv_v.ad line 2494: > >> 2492: >> 2493: instruct reduce_mulF(fRegF dst, fRegF fsrc, vReg vsrc, >> 2494: vReg tmp1, vReg tmp2) %{ > > Seems to me that we need a `predicate(!n->as_Reduction()->requires_strict_order())` for all these newly-added floating-point reduce multiply matched rules. The reason is that your `reduce_mul_fp_v` doesn't respect the order of the operands. And we should better rename the name of the match rules to something like `unordered_reduce_mulF`. > > Reference: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L5281 Yes, you're right, fixed. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1959866663 From yzheng at openjdk.org Tue Feb 18 14:37:13 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 18 Feb 2025 14:37:13 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: <_W-SlkWqnlGecmjOwEljPAfoSHp4CpwgnkjgJotYfGE=.b361c87e-3609-4011-9f21-e0c3b9b70b4c@github.com> References: <_W-SlkWqnlGecmjOwEljPAfoSHp4CpwgnkjgJotYfGE=.b361c87e-3609-4011-9f21-e0c3b9b70b4c@github.com> Message-ID: On Tue, 18 Feb 2025 14:28:58 GMT, Doug Simon wrote: >> test/hotspot/jtreg/compiler/jvmci/TestUncaughtErrorInCompileMethod.java line 89: >> >>> 87: "-XX:+UnlockExperimentalVMOptions", >>> 88: "-XX:+UseJVMCICompiler", "-Djvmci.Compiler=ErrorCompiler", >>> 89: "-XX:-UseJVMCINativeLibrary", >> >> do we support this in a jdk build? > > Not sure I understand the question: `UseJVMCINativeLibrary` is a JVMCI flag just like all other JVMCI flags. IIUC `-UseJVMCINativeLibrary` will not use `libjvmcicompiler.so` but we don't include any Graal java class in `jmods/jdk.graal.compiler.jmod`, then what does it run? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23679#discussion_r1959876342 From dnsimon at openjdk.org Tue Feb 18 14:40:09 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 18 Feb 2025 14:40:09 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: References: <_W-SlkWqnlGecmjOwEljPAfoSHp4CpwgnkjgJotYfGE=.b361c87e-3609-4011-9f21-e0c3b9b70b4c@github.com> Message-ID: <_CLZ0-JQgZWSio67jOWlbcwdhY-7hWpkodvWf9c0BKQ=.aaa06d53-cd06-4a96-851d-e8cd66eec6c8@github.com> On Tue, 18 Feb 2025 14:34:11 GMT, Yudi Zheng wrote: >> Not sure I understand the question: `UseJVMCINativeLibrary` is a JVMCI flag just like all other JVMCI flags. > > IIUC `-UseJVMCINativeLibrary` will not use `libjvmcicompiler.so` but we don't include any Graal java class in `jmods/jdk.graal.compiler.jmod`, then what does it run? The `-Djvmci.Compiler=ErrorCompiler` option specifies a different compiler implementation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23679#discussion_r1959883507 From shade at openjdk.org Tue Feb 18 14:42:10 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 18 Feb 2025 14:42:10 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 17:04:41 GMT, Aleksey Shipilev wrote: > Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. > > Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes > - [x] Linux AArch64 server fastdebug, CTW tests still work fine > - [x] Linux AArch64 server fastdebug, `all` Passes `all` tests for me here. So I think we are ready to integrate, as soon as another Reviewer acks :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23668#issuecomment-2665908997 From yzheng at openjdk.org Tue Feb 18 14:43:24 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 18 Feb 2025 14:43:24 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 14:11:04 GMT, Doug Simon wrote: > [JDK-8346781](https://bugs.openjdk.org/browse/JDK-8346781) changed `JVMCIServiceLocator` such that set of providers is computed eagerly in `JVMCIServiceLocator.`. There are some JVMCI test classes that directly subclassed `JVMCIServiceLocator` which meant deadlock could occur between the main thread running the test and a JVMCI compiler thread. This PR fixes this by ensuring that `JVMCIServiceLocator` providers are separate classes from the main test class. Marked as reviewed by yzheng (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23679#pullrequestreview-2623815038 From yzheng at openjdk.org Tue Feb 18 14:43:24 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 18 Feb 2025 14:43:24 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: <_CLZ0-JQgZWSio67jOWlbcwdhY-7hWpkodvWf9c0BKQ=.aaa06d53-cd06-4a96-851d-e8cd66eec6c8@github.com> References: <_W-SlkWqnlGecmjOwEljPAfoSHp4CpwgnkjgJotYfGE=.b361c87e-3609-4011-9f21-e0c3b9b70b4c@github.com> <_CLZ0-JQgZWSio67jOWlbcwdhY-7hWpkodvWf9c0BKQ=.aaa06d53-cd06-4a96-851d-e8cd66eec6c8@github.com> Message-ID: On Tue, 18 Feb 2025 14:37:40 GMT, Doug Simon wrote: >> IIUC `-UseJVMCINativeLibrary` will not use `libjvmcicompiler.so` but we don't include any Graal java class in `jmods/jdk.graal.compiler.jmod`, then what does it run? > > The `-Djvmci.Compiler=ErrorCompiler` option specifies a different compiler implementation. I see now. So `+UseJVMCINativeLibrary` overrides `-Djvmci.Compiler=ErrorCompiler` and that is why we have to explicitly specify `-UseJVMCINativeLibrary`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23679#discussion_r1959889782 From chagedorn at openjdk.org Tue Feb 18 14:46:46 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Feb 2025 14:46:46 GMT Subject: RFR: 8350197: [UBSAN] Node::dump_idx reported float-cast-overflow In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 13:04:34 GMT, SendaoYan wrote: > Hi all, > > The function of 'Node::dump_idx(bool, outputStream*, Node::DumpConfig*)' in file src/hotspot/share/opto/node.cpp:2430 reported "runtime error: -inf is outside the range of representable values of type 'unsigned int'" by clang17's UndefinedBehaviorSanitizer. > > This PR add an extra check for the argument before pass call to `log10`. Risk is low. > > Additional testing: > > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-x64 with release build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-aarch64 with release build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-x64 with fastdebug build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-aarch64 with fastdebug build > > Below code snippet demonstrate the undefined behaviour of float-cast-overflow: > > > #include > #include > int input = 0; > int main() > { > printf("result = %lf\n", log10((double)input)); > printf("result = %u\n", (unsigned int)log10((double)input)); > printf("result = %u\n", input==0 ? 0 : (unsigned int)log10((double)input)); > return 0; > } > > > >> clang -fsanitize=undefined log10.c -lm && ./a.out > result = -inf > log10.c:9:27: runtime error: -inf is outside the range of representable values of type 'unsigned int' > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior log10.c:9:27 in > result = 0 > result = 0 Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23662#pullrequestreview-2623826163 From dnsimon at openjdk.org Tue Feb 18 14:51:38 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 18 Feb 2025 14:51:38 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: References: <_W-SlkWqnlGecmjOwEljPAfoSHp4CpwgnkjgJotYfGE=.b361c87e-3609-4011-9f21-e0c3b9b70b4c@github.com> <_CLZ0-JQgZWSio67jOWlbcwdhY-7hWpkodvWf9c0BKQ=.aaa06d53-cd06-4a96-851d-e8cd66eec6c8@github.com> Message-ID: <02kc9nQVOHlcRSEmAiXqEcxo1L2wA6T0PzWX7YAtqp8=.58a3451d-7df3-442a-a32e-90d5a4c1cab1@github.com> On Tue, 18 Feb 2025 14:40:33 GMT, Yudi Zheng wrote: >> The `-Djvmci.Compiler=ErrorCompiler` option specifies a different compiler implementation. > > I see now. So `+UseJVMCINativeLibrary` overrides `-Djvmci.Compiler=ErrorCompiler` and that is why we have to explicitly specify `-UseJVMCINativeLibrary`? Yes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23679#discussion_r1959905318 From dfenacci at openjdk.org Tue Feb 18 18:56:57 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 18 Feb 2025 18:56:57 GMT Subject: RFR: 8341976: C2: use_mem_state != load->find_exact_control(load->in(0)) assert failure [v2] In-Reply-To: References: Message-ID: <_WNbZuoO2hjfrclTBVHy2hkv17-SwHdGELID-dtg58I=.70e29f75-6cb5-4e5b-8b94-493fd1200b85@github.com> On Tue, 18 Feb 2025 15:47:39 GMT, Roland Westrelin wrote: > Thanks for the report. I can't reproduce it, though. Do you pass any command line options? Nothing specific, just a simple `jtreg` command with a debug build and no extra options, i.e. `jtreg -va -jdk:../build/linux-x64-debug/jdk test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyOverflowInBoundChecks.java` on a Intel Xeon machine (with avx512) with Ubuntu. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23465#issuecomment-2666176215 From kvn at openjdk.org Tue Feb 18 19:10:12 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:10:12 GMT Subject: RFR: 8350210: CTW: Use stackless exceptions In-Reply-To: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> References: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> Message-ID: <0kfGlMp8yEpVbZoNzefx81FOOE57aiPr0bSTLUeBCbc=.33754339-b0a5-4285-81c2-21eb7e4df48b@github.com> On Tue, 18 Feb 2025 09:05:36 GMT, Aleksey Shipilev wrote: > Looking at reducing CTW costs in our infra, there are a few simple improvements we can take. > > CTW runners compiling 3rd party JARs normally catch lots of stray exceptions when trying to load non-existing classes, e.g. for resolving the static final fields, or preloading the constant pool. Generating stack traces for these take considerable time, and stack traces for those exceptions are not essential to debug CTW runs. So, we can summarily disable them. > > This has no effect on `applications/ctw/modules`. Compiling a large 3rd party JAR like `solr-core-7.4.0.jar`, for example, improves from ~15.3s to ~12.5s. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23671#pullrequestreview-2624188804 From mdoerr at openjdk.org Tue Feb 18 19:11:45 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 18 Feb 2025 19:11:45 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v2] In-Reply-To: References: Message-ID: <3ajDco2b3hi4kJez0bf4igsZCrnoqbjeXaWmnmxU_P0=.7836a06c-fd7a-4a50-9492-9af1a18d7584@github.com> > PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). > The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. > > Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` > > Before this patch (C code) > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op > ... > SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op > SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op > SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op > SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op > SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op > SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op > SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op > SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op > SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op > SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op > SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op > SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op > SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op > SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op > SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op > ... > SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op > SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op > SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op > SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op > SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op > SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op > SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op > SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op > SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op > SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op > SecondarySupersLookup.testPositive64 avgt 15 29.932 ? 1.002 ns/op > > > After this patch (assemble... Martin Doerr has updated the pull request incrementally with two additional commits since the last revision: - Usr rt_call without FunctionDescriptor. - Revert relocInfo changes. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23602/files - new: https://git.openjdk.org/jdk/pull/23602/files/f5f6f771..84c7aacc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23602&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23602&range=00-01 Stats: 45 lines in 3 files changed: 4 ins; 27 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/23602.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23602/head:pull/23602 PR: https://git.openjdk.org/jdk/pull/23602 From dnsimon at openjdk.org Tue Feb 18 19:12:21 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 18 Feb 2025 19:12:21 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 14:11:04 GMT, Doug Simon wrote: > [JDK-8346781](https://bugs.openjdk.org/browse/JDK-8346781) changed `JVMCIServiceLocator` such that set of providers is computed eagerly in `JVMCIServiceLocator.`. There are some JVMCI test classes that directly subclassed `JVMCIServiceLocator` which meant deadlock could occur between the main thread running the test and a JVMCI compiler thread. This PR fixes this by ensuring that `JVMCIServiceLocator` providers are separate classes from the main test class. Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23679#issuecomment-2666659103 From never at openjdk.org Tue Feb 18 19:12:21 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 18 Feb 2025 19:12:21 GMT Subject: RFR: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 14:11:04 GMT, Doug Simon wrote: > [JDK-8346781](https://bugs.openjdk.org/browse/JDK-8346781) changed `JVMCIServiceLocator` such that set of providers is computed eagerly in `JVMCIServiceLocator.`. There are some JVMCI test classes that directly subclassed `JVMCIServiceLocator` which meant deadlock could occur between the main thread running the test and a JVMCI compiler thread. This PR fixes this by ensuring that `JVMCIServiceLocator` providers are separate classes from the main test class. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23679#pullrequestreview-2624667698 From roland at openjdk.org Tue Feb 18 18:56:55 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 18:56:55 GMT Subject: RFR: 8341976: C2: use_mem_state != load->find_exact_control(load->in(0)) assert failure [v2] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 13:44:14 GMT, Damon Fenacci wrote: > I noticed that the test `compiler/arraycopy/TestArrayCopyOverflowInBoundChecks.java` seems to be failing on linux x64 with the same `assert(use_mem_state != load->find_exact_control(load->in(0))) failed: dependence cycle found` error **after** your fix: > > ``` > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (src/hotspot/share/opto/gcm.cpp:916), pid=1506939, tid=1506963 > # assert(use_mem_state != load->find_exact_control(load->in(0))) failed: dependence cycle found > # > ``` Thanks for the report. I can't reproduce it, though. Do you pass any command line options? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23465#issuecomment-2666109398 From bulasevich at openjdk.org Tue Feb 18 19:23:59 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 18 Feb 2025 19:23:59 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v11] In-Reply-To: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> > This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. > > The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. > > Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. > > The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): > - nmethod_count:134000, total_compilation_time: 510460ms > - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, > - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB > > Functional testing: jtreg on arm/aarch/x86. > Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. > > Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description - add a separate adrp_movk function to to support targets located more than 4GB away - Force the use of movk in combination with adrp and ldr instructions to address scenarios where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp - Fixing TestFindInstMemRecursion test fail with XX:+StressReflectiveCode option: _relocation_size can exceed 64Kb, in this case _metadata_offset do not fit into int16. Fix: use _oops_size int16 field to calculate metadata offset - removing dead code - a bit of cleanup and addressing review suggestions - rework movoop for not_supports_instruction_patching case: correcting in ldr_constant and relocations fixup - remove _code_end_offset - update jvm.hotspot.code.CodeBlob class - update: mutable data for all CodeBlobs with relocations - ... and 2 more: https://git.openjdk.org/jdk/compare/e1d0a9c8...6c3370be ------------- Changes: https://git.openjdk.org/jdk/pull/21276/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21276&range=10 Stats: 268 lines in 10 files changed: 119 ins; 53 del; 96 mod Patch: https://git.openjdk.org/jdk/pull/21276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21276/head:pull/21276 PR: https://git.openjdk.org/jdk/pull/21276 From kvn at openjdk.org Tue Feb 18 19:24:22 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:24:22 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Wed, 12 Feb 2025 03:03:34 GMT, Chris Plummer wrote: >> Before I forgot to answer you, @plummercj >> I completely agree with your comment about cleaning up wrapper subclasses which do nothing. >> >> I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there? >> >> An other purpose could be a place holder for additional information in a future which never come. >> >> Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM. >> >> So yes, feel free to clean this up. I will help with review. > >> I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there? > > Possibly getName() didn't exist when PStack was first written. It would be good if PStack not only included the type name as it does now, but also the actual name of the blob, which getName() would return. > >> An other purpose could be a place holder for additional information in a future which never come. > > Yes, and you also see that with the Observer registration and the `Type type = db.lookupType()` code, which are only needed if you are going to lookup fields of the subtypes, which most don't ever do, yet they all have this code. > >> Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM. > > Yeah, that's not working right for CodeBlob subtypes that are not RuntimeStubs. Easy to fix though. > >> So yes, feel free to clean this up. I will help with review. > > Ok. Let me see where things are at after you are done with the PR. Thank you, @plummercj , for review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2666228333 From kvn at openjdk.org Tue Feb 18 19:24:34 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:24:34 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove commented lines left by mistake Thank you all for reviews and suggestions. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2666253220 From chagedorn at openjdk.org Tue Feb 18 19:25:14 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Feb 2025 19:25:14 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Tue, 18 Feb 2025 08:56:42 GMT, Christian Hagedorn wrote: >> In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. >> >> Thanks, >> Marc > >> Hello @marc-chevalier! I have already open a PR for this matter. PR is #23480. > > Hi @gustavosimon, the JBS issue was already assigned to @marc-chevalier. If you intend to work on an issue, please check the following: > - The issue is already assigned in JBS? > - Reach out to the assignee and ask if the person is currently working on the issue or has intentions to do so. If not, they can reassign it to you or someone else on your behalf (if you don't have a JBS account). > - The issue is unassigned in JBS? > - Assign the issue to yourself. > - If you don't have a JBS account: Reach out to someone who can assign it to him/herself on your behalf. > > > This avoids "stealing" work that was in progress or planned to do later or even worse doing completely duplicated work which is unfortunate. > @chhagedorn Got it. Actually, when I started working on this, the issue was unassigned. Oh, I see - looks like an unfortunate timing! > I will ask to @RealCLanger to assign it to me next times. Sounds good :-) > Can you review my OCA verification? We pinged @robilad to review it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2666025700 From kvn at openjdk.org Tue Feb 18 19:26:04 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:26:04 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: On Tue, 18 Feb 2025 10:07:07 GMT, Emanuel Peter wrote: >>> That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? >> >> There is code that removes the `OuterStripMinedLoop` if the `CountedLoop` goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoop` is left behind without a `CountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity. > >> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? > >>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity. > > Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one? @eme64, my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance. Do we really need it for this changes? Can we simply generate un-vectorized loop? " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support? An other question is what deoptimization `Action` is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666176147 From epeter at openjdk.org Tue Feb 18 19:26:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 19:26:07 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: On Tue, 18 Feb 2025 16:10:20 GMT, Vladimir Kozlov wrote: >>> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? >> >>>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity. >> >> Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one? > > @eme64, my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance. Do we really need it for this changes? Can we simply generate un-vectorized loop? > > " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support? > > An other question is what deoptimization `Action` is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too? @vnkozlov > " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support? I would say most of the code here, i.e. the predicate and multi-version parts are also relevant for the up-coming patch for aliasing analysis runtime-checks. These are especially important for `MemorySegment` cases where there could basically always be aliasing and only runtime-checks can help us vectorize. There is really only a small part, which is emitting the actual alignment-check. > Do we really need it for this changes? Can we simply generate un-vectorized loop? The alternatives on architectures that are actually affected by this bug: - Not fix the bug, and risk possible `SIGBUS`. And on our platforms, that just means living with the HALT caused by `VerifyAlignVector`. - Disable ALL vectorization of cases where we cannot guarantee statically that accesses are aligned. That would certainly disable all uses of `MemorySegment`, and that is probably not preferrable. > my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance. Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect. Also, with OSR we already currently don't generate predicates, and so it is generating the multi-versioning for those. And I really could not measure any difference in the performance benchmarking. I doubt it is even noticable on compile-time. > An other question is what deoptimization Action is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too? Yes, when we deopt for the bci, we recompile immediately. The alternative is to make the check per method, but then the risk is that one loop deopting causes other loops to be multi-versioned instead of using predicates too. Counting deopts per bci is currently not done at all. But I suppose we could make it a bit more "forgiving"... but is that worth it? I suppose if in reallity we do see non-aligned cases (or in the future cases where we have problematic aliasing), then it will probably repeat, and is worth recompiling to handle both cases. But that is speculation, and we can discuss :) TLDR: @vnkozlov I would not have fixed the bug with such a heavy mechanism if I did not intend to use it for runtime-check for aliasing analysis. And 90% of the code here is reusable for that. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666357998 From kvn at openjdk.org Tue Feb 18 19:26:16 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:26:16 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter wrote: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code? About actual probability value. I was thinking PROB_LIKELY_MAG(3). PROB_LIKELY_MAG(1) will only guarantee that vectorized loop will be first but it could be enough without moving other loop from hot path. Needs testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666554240 PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666710345 From epeter at openjdk.org Tue Feb 18 19:26:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 19:26:18 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 18:29:42 GMT, Vladimir Kozlov wrote: > What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code? I'm not sure what you are asking. Are you asking what probability I'm setting for the multi-version branch? This is the loop selector, which later gets copied for each of the checks. `const LoopSelector loop_selector(lpt, opaque, PROB_FAIR, COUNT_UNKNOWN);` So 50%. But maybe you are suggesting it should really be biased towards the fast-path, right? What probability would you suggest? It should probably be fairly low, since there can be multiple checks added, and each one lowers the probability of arriving at the true-loop. So for scheduling, we should keep the probability high, so the true-loop is scheduled closer, right? Is that what you meant? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666602599 From kvn at openjdk.org Tue Feb 18 19:26:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:26:19 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 18:45:34 GMT, Emanuel Peter wrote: > > What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code? > > I'm not sure what you are asking. Are you asking what probability I'm setting for the multi-version branch? > > This is the loop selector, which later gets copied for each of the checks. `const LoopSelector loop_selector(lpt, opaque, PROB_FAIR, COUNT_UNKNOWN);` > > So 50%. But maybe you are suggesting it should really be biased towards the fast-path, right? What probability would you suggest? It should probably be fairly low, since there can be multiple checks added, and each one lowers the probability of arriving at the true-loop. So for scheduling, we should keep the probability high, so the true-loop is scheduled closer, right? > > Is that what you meant? Yes. I want prioritize fast path assuming it is vectorized loop and that we get aligned data more frequently. It is actually difficult to judge without statistic from real applications. It should be reversed if an application works mostly on unaligned data. Can we profile alignment in Interpreter (and C1)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666635167 From kvn at openjdk.org Tue Feb 18 19:26:11 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:26:11 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: <7CUvxR76ROhB7TB2qqbF2nQB5RNIj4GpRvKqZSw-dDM=.8917fc6a-3e84-4a9b-8df7-2eec07cfa768@github.com> On Tue, 18 Feb 2025 17:20:23 GMT, Emanuel Peter wrote: > Do we really need it for this changes? Can we simply generate un-vectorized loop? To clarify. This question was about second phase after we deoptimize and recompile when hit predicate check failure. I am fine with predicate change. > And I really could not measure any difference in the performance benchmarking. I doubt it is even noticable on compile-time. Right. If a method has a vectorizable loop, it is most likely has big generated code and not inlined already. So adding 4th loop may not affected significantly. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666506254 PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666525354 From dlunden at openjdk.org Tue Feb 18 19:27:00 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 18 Feb 2025 19:27:00 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v4] In-Reply-To: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> References: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> Message-ID: On Tue, 18 Feb 2025 12:13:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Remove redundant 'alias_field' property Still good! ------------- Marked as reviewed by dlunden (Committer). PR Review: https://git.openjdk.org/jdk/pull/23621#pullrequestreview-2624336330 From chagedorn at openjdk.org Tue Feb 18 19:26:57 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 18 Feb 2025 19:26:57 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v4] In-Reply-To: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> References: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> Message-ID: On Tue, 18 Feb 2025 12:13:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Remove redundant 'alias_field' property Update looks good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23621#pullrequestreview-2623954343 From rcastanedalo at openjdk.org Tue Feb 18 19:27:02 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 19:27:02 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v4] In-Reply-To: References: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> Message-ID: On Tue, 18 Feb 2025 15:21:21 GMT, Christian Hagedorn wrote: > Update looks good, thanks! Thanks, Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2666336473 From rcastanedalo at openjdk.org Tue Feb 18 19:27:03 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 18 Feb 2025 19:27:03 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v4] In-Reply-To: References: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> Message-ID: On Tue, 18 Feb 2025 17:16:37 GMT, Daniel Lund?n wrote: > Still good! Thanks, Daniel! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2666357749 From kxu at openjdk.org Tue Feb 18 19:27:22 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 18 Feb 2025 19:27:22 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v2] In-Reply-To: References: Message-ID: > [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. > > When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) > > The following was implemented to address this issue. > > if (UseNewCode2) { > *multiplier = bt == T_INT > ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows > : ((jlong) 1) << con->get_int(); > } else { > *multiplier = ((jlong) 1 << con->get_int()); > } > > > Two new bitshift overflow tests were added. Kangcheng Xu has updated the pull request incrementally with two additional commits since the last revision: - use explicit argument types for overloaded java_shift_left() - use java_shift_left() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23506/files - new: https://git.openjdk.org/jdk/pull/23506/files/92100991..92411fbe Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=00-01 Stats: 6 lines in 1 file changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23506.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23506/head:pull/23506 PR: https://git.openjdk.org/jdk/pull/23506 From haosun at openjdk.org Tue Feb 18 19:27:37 2025 From: haosun at openjdk.org (Hao Sun) Date: Tue, 18 Feb 2025 19:27:37 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:20:54 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2. ### data-1: UseSVE=0 Before After Gain Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 400.850304 1.109497 35229.489297 62.602965 87.88 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 201.425559 0.478769 18457.865560 21.655711 91.63 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 55.623907 0.238778 55.479367 0.259319 0.99 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.700079 0.073881 27.782368 0.125652 1.00 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.179064 0.490253 5137.062026 22.341864 47.48 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.354705 0.235878 2600.296050 11.659880 47.83 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 107.876699 0.362950 6092.072276 26.235411 56.47 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 54.173753 0.137934 3083.301351 23.996634 56.91 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.828919 0.197490 55.278519 0.543387 0.99 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.841811 0.197133 27.701294 0.170357 0.99 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.256878 0.610474 12284.067528 22.269728 57.87 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 106.237899 0.292940 6195.468269 10.163818 58.31 2048 Since "double" and "long" type are not supported on Neon, no obvious performance change is observed for `selectFromDoubleVector` or `selectFromLongVector`. It's as expected. ### data-2: UseSVE=2 Before After Gain Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 401.283626 1.185346 35212.914922 48.146517 87.75 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 200.442895 0.354457 18484.335484 31.659515 92.21 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 56.093979 0.259369 3870.627049 15.037254 69.00 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.761792 0.150907 1981.828293 2.749076 71.38 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.203125 0.593284 5791.568827 14.214889 53.52 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.388489 0.238700 2956.726043 10.504617 54.36 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 108.362433 0.290180 9389.915021 84.968822 86.65 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 53.982112 0.210067 4790.062993 2.123039 88.73 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.583716 0.222332 4725.276744 6.347278 85.01 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.967713 0.143626 2328.371821 15.504931 83.25 2048 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.137873 0.586753 18484.651452 8.215293 87.13 1024 org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 105.692702 0.641425 9386.506869 80.958276 88.80 2048 note-1: "double" and "long" are supported on SVE2, hence we observed obvious performance uplifts for `selectFromDoubleVector` and `selectFromLongVector` now. It's as expected. note-2: I observed much difference between data-2 and your data listed in the commit msg. in your data, `1.4~1.7x` is gained for "byte|float|int|short" types. However, my data is much bigger, i.e. `53~92x`. it's a bit wired. src/hotspot/cpu/aarch64/aarch64.ad line 889: > 887: ); > 888: > 889: // Class for vector register v18 nit: use upper case Suggestion: // Class for vector register V18 src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 4225: > 4223: > 4224: // SVE2 programmable table lookup in two vector table > 4225: void sve2_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn1, I suggest using `sve_tbl` here. 1. the SVE1 insn is `sve_tbl` as well, but we can distinguish them thanks to function overloading 2. following the same naming style of other sve2 instructions ------------- PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-2623674691 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1959810281 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1959820604 From bkilambi at openjdk.org Tue Feb 18 19:27:38 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 18 Feb 2025 19:27:38 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 15:06:17 GMT, Hao Sun wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2. > > > ### data-1: UseSVE=0 > > > Before After Gain > Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 400.850304 1.109497 35229.489297 62.602965 87.88 1024 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 201.425559 0.478769 18457.865560 21.655711 91.63 2048 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 55.623907 0.238778 55.479367 0.259319 0.99 1024 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.700079 0.073881 27.782368 0.125652 1.00 2048 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.179064 0.490253 5137.062026 22.341864 47.48 1024 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.354705 0.235878 2600.296050 11.659880 47.83 2048 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 107.876699 0.362950 6092.072276 26.235411 56.47 1024 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 54.173753 0.137934 3083.301351 23.996634 56.91 2048 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.828919 0.197490 55.278519 0.543387 0.99 1024 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.841811 0.197133 27.701294 0.170357 0.99 2048 > org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.256878 0.610474 12284.067528 22.26... Thanks for your review comments @shqking This commit added mid-end support for SelectFromTwoVector operation - https://github.com/openjdk/jdk/commit/709914fc92dd180c8f081ff70ef476554a04f4ce. It adds intrinsics for SelectFromTwoVector operation and on machines that do not support this operation, a lowering vector operation (VectorRearrange + VectorBlend combination) is generated. On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation - https://github.com/openjdk/jdk/blob/e1d0a9c832ef3e92faaed7f290ff56c0ed8a9d94/src/hotspot/share/opto/vectorIntrinsics.cpp#L2736. As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated. I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-2666070199 From kvn at openjdk.org Tue Feb 18 20:11:04 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 20:11:04 GMT Subject: Integrated: 8349088: De-virtualize Codeblob and nmethod In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 17:45:30 GMT, Vladimir Kozlov wrote: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp This pull request has now been integrated. Changeset: 46d4a601 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/46d4a601e04f90b11d4ccc97a49f4e7010b4fd83 Stats: 529 lines in 23 files changed: 262 ins; 152 del; 115 mod 8349088: De-virtualize Codeblob and nmethod Co-authored-by: Stefan Karlsson Co-authored-by: Chris Plummer Reviewed-by: cjplummer, aboldtch, dlong ------------- PR: https://git.openjdk.org/jdk/pull/23533 From dnsimon at openjdk.org Tue Feb 18 20:21:01 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 18 Feb 2025 20:21:01 GMT Subject: Integrated: 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out In-Reply-To: References: Message-ID: <82moTqja32LQDlF_pT43HHmzGr906Tkc9SSsw8WxX54=.2649edff-fcfd-4934-8d59-627990620e31@github.com> On Tue, 18 Feb 2025 14:11:04 GMT, Doug Simon wrote: > [JDK-8346781](https://bugs.openjdk.org/browse/JDK-8346781) changed `JVMCIServiceLocator` such that set of providers is computed eagerly in `JVMCIServiceLocator.`. There are some JVMCI test classes that directly subclassed `JVMCIServiceLocator` which meant deadlock could occur between the main thread running the test and a JVMCI compiler thread. This PR fixes this by ensuring that `JVMCIServiceLocator` providers are separate classes from the main test class. This pull request has now been integrated. Changeset: f2b4e12a Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/f2b4e12afe67086a2ae08081fd545e5ce4d731fd Stats: 60 lines in 4 files changed: 21 ins; 12 del; 27 mod 8350263: JvmciNotifyBootstrapFinishedEventTest intermittently times out Reviewed-by: yzheng, never ------------- PR: https://git.openjdk.org/jdk/pull/23679 From vlivanov at openjdk.org Tue Feb 18 21:46:53 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 18 Feb 2025 21:46:53 GMT Subject: RFR: 8350211: CTW: Attempt to preload all classes in constant pool In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 11:09:40 GMT, Aleksey Shipilev wrote: > CTW runners do preloading for constant pools ahead of time. I believe this is done to expose more loaded classes to the compilations, so to extend the compilation scope. > > Unfortunately, current code catches the first exception when loading the constant pool and stops preloading. This routinely happens when CTW runner processes a 3rd party JAR, where dependencies might normally be in other JARs. > > I believe we should attempt to resolve all constant pool entries when preloading is requested. This would likely expand the scope of CTW testing. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` still passes Nice finding, Aleksey. Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23673#pullrequestreview-2625102449 From dlong at openjdk.org Wed Feb 19 00:37:14 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 00:37:14 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: > When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. > > In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. > > Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. Dean Long has updated the pull request incrementally with one additional commit since the last revision: Stricter assertion on ppc64 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23557/files - new: https://git.openjdk.org/jdk/pull/23557/files/a7a0ed7a..ebf10dae Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23557&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23557&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23557.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23557/head:pull/23557 PR: https://git.openjdk.org/jdk/pull/23557 From dlong at openjdk.org Wed Feb 19 00:39:54 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 00:39:54 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Mon, 17 Feb 2025 11:27:17 GMT, Richard Reingruber wrote: >>> I think you can make the assertion a little stricter like this [reinrich at 9c3c8a3](https://github.com/reinrich/jdk/commit/9c3c8a33a29b9ae6c4c703992b306dc0cbbcd2f0). >> >> Regarding this stricter version, why are you using is_bottom_frame instead of is_top_frame? The deoptimization code seems to name the most recent leaf frame "top". That sounds like what frame::top_ijava_frame_abi_size is for too. > >> > I think you can make the assertion a little stricter like this [reinrich at 9c3c8a3](https://github.com/reinrich/jdk/commit/9c3c8a33a29b9ae6c4c703992b306dc0cbbcd2f0). >> >> Regarding this stricter version, why are you using is_bottom_frame instead of is_top_frame? The deoptimization code seems to name the most recent leaf frame "top". That sounds like what frame::top_ijava_frame_abi_size is for too. > > Correct, the top frame has a frame::top_ijava_frame_abi but the assertion is about the abi section in the current frame's caller and the the bottom frame's caller also has a top_ijava_frame_abi because i2c doesn't modify it. > > Continue reading if you're interested in more details... > > As said the i2c adapter does *not* trimm the caller frame as the interpreter would, > replacing its large `top_ijava_frame_abi` with a smaller > `parent_ijava_frame_abi`. > > > > Example: compiled frame DEOPTEE is replaced with 3 interpreted frames > > Stack before deoptimization > > | | > | Interpreted CALLER | > | of DEOPTEE frame | > | | > +------------------------+ > | | > | top_ijava_frame_abi | > | | > +========================+ > | | > | Compiled | > | DEOPTEE | > | | > +------------------------+ > | java_abi | > +========================+ > > > Stack when assertion is checked > (i.e. after DEOPTEE was replaced by corresponding inter. frames) > > | | > | Interpreted CALLER | > | of DEOPTEE frame | > | | > +------------------------+ > | | > | top_ijava_frame_abi | <- i2c keeps large abi > | | > +========================+ > | | <- bottom frame > | Interpreted Frame 0 | > | corresp. to DEOPTEE | > | | > +------------------------+ > | parent_ijava_frame_abi | > +========================+ > | | > | Interpreted Frame 1 | > | (inlined by DEOPTEE) | > | | > +------------------------+ > | parent_ijava_frame_abi | > +========================+ > | | <- top frame > | Interpreted Frame 2 | > | (inlined by DEOPTEE) | > | | > +------------------------+ > | | > | top_ijava_frame_abi | > | | > +========================+ > > Notes: > (refering to the frame sections rather than the C++ types) > > - top_ijava_frame_abi comp... @reinrich OK, got it! I pushed your change. Could you also comment on if we could use the value of sender_sp here instead? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2667234850 From fyang at openjdk.org Wed Feb 19 00:51:59 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 19 Feb 2025 00:51:59 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v5] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 14:30:28 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/riscv_v.ad line 2494: >> >>> 2492: >>> 2493: instruct reduce_mulF(fRegF dst, fRegF fsrc, vReg vsrc, >>> 2494: vReg tmp1, vReg tmp2) %{ >> >> Seems to me that we need a `predicate(!n->as_Reduction()->requires_strict_order())` for all these newly-added floating-point reduce multiply matched rules. The reason is that your `reduce_mul_fp_v` doesn't respect the order of the operands. And we should better rename the name of the match rules to something like `unordered_reduce_mulF`. >> >> Reference: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L5281 > > Yes, you're right, fixed. Thanks! Thanks for the update. So now the FP reduce multiply will only apply to the Vector-API use case. Did you check the auto-vectorization use case? I suppose the two tests `ProdRed_Double.java` & `ProdRed_Float.java` which are enabled for riscv64 by this PR won't work with this update. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1960789839 From duke at openjdk.org Wed Feb 19 01:23:58 2025 From: duke at openjdk.org (simon) Date: Wed, 19 Feb 2025 01:23:58 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Tue, 18 Feb 2025 15:19:26 GMT, Christian Hagedorn wrote: >>> Hello @marc-chevalier! I have already open a PR for this matter. PR is #23480. >> >> Hi @gustavosimon, the JBS issue was already assigned to @marc-chevalier. If you intend to work on an issue, please check the following: >> - The issue is already assigned in JBS? >> - Reach out to the assignee and ask if the person is currently working on the issue or has intentions to do so. If not, they can reassign it to you or someone else on your behalf (if you don't have a JBS account). >> - The issue is unassigned in JBS? >> - Assign the issue to yourself. >> - If you don't have a JBS account: Reach out to someone who can assign it to him/herself on your behalf. >> >> >> This avoids "stealing" work that was in progress or planned to do later or even worse doing completely duplicated work which is unfortunate. > >> @chhagedorn Got it. Actually, when I started working on this, the issue was unassigned. > > Oh, I see - looks like an unfortunate timing! > >> I will ask to @RealCLanger to assign it to me next times. > > Sounds good :-) > >> Can you review my OCA verification? > > We pinged @robilad to review it. @chhagedorn Thanks! Looking forward! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2667285976 From haosun at openjdk.org Wed Feb 19 01:31:54 2025 From: haosun at openjdk.org (Hao Sun) Date: Wed, 19 Feb 2025 01:31:54 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 15:34:11 GMT, Bhavana Kilambi wrote: >> Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2. >> >> >> ### data-1: UseSVE=0 >> >> >> Before After Gain >> Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 400.850304 1.109497 35229.489297 62.602965 87.88 1024 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 201.425559 0.478769 18457.865560 21.655711 91.63 2048 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 55.623907 0.238778 55.479367 0.259319 0.99 1024 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.700079 0.073881 27.782368 0.125652 1.00 2048 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.179064 0.490253 5137.062026 22.341864 47.48 1024 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.354705 0.235878 2600.296050 11.659880 47.83 2048 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 107.876699 0.362950 6092.072276 26.235411 56.47 1024 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 54.173753 0.137934 3083.301351 23.996634 56.91 2048 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.828919 0.197490 55.278519 0.543387 0.99 1024 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.841811 0.197133 27.701294 0.170357 0.99 2048 >> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.256878 ... > > Thanks for your review comments @shqking > This commit added mid-end support for SelectFromTwoVector operation - https://github.com/openjdk/jdk/commit/709914fc92dd180c8f081ff70ef476554a04f4ce. It adds intrinsics for SelectFromTwoVector operation and on machines that do not support this operation, a lowering vector operation (VectorRearrange + VectorBlend combination) is generated. > > On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation - https://github.com/openjdk/jdk/blob/e1d0a9c832ef3e92faaed7f290ff56c0ed8a9d94/src/hotspot/share/opto/vectorIntrinsics.cpp#L2736. > As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated. > > I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :) Thanks for your explanation. Sounds reasonable to me. @Bhavana-Kilambi ------------- PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-2667296885 From syan at openjdk.org Wed Feb 19 01:32:56 2025 From: syan at openjdk.org (SendaoYan) Date: Wed, 19 Feb 2025 01:32:56 GMT Subject: RFR: 8350197: [UBSAN] Node::dump_idx reported float-cast-overflow In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 13:04:34 GMT, SendaoYan wrote: > Hi all, > > The function of 'Node::dump_idx(bool, outputStream*, Node::DumpConfig*)' in file src/hotspot/share/opto/node.cpp:2430 reported "runtime error: -inf is outside the range of representable values of type 'unsigned int'" by clang17's UndefinedBehaviorSanitizer. > > This PR add an extra check for the argument before pass call to `log10`. Risk is low. > > Additional testing: > > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-x64 with release build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-aarch64 with release build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-x64 with fastdebug build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-aarch64 with fastdebug build > > Below code snippet demonstrate the undefined behaviour of float-cast-overflow: > > > #include > #include > int input = 0; > int main() > { > printf("result = %lf\n", log10((double)input)); > printf("result = %u\n", (unsigned int)log10((double)input)); > printf("result = %u\n", input==0 ? 0 : (unsigned int)log10((double)input)); > return 0; > } > > > >> clang -fsanitize=undefined log10.c -lm && ./a.out > result = -inf > log10.c:9:27: runtime error: -inf is outside the range of representable values of type 'unsigned int' > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior log10.c:9:27 in > result = 0 > result = 0 Thanks for the reviews. Tests run finish, and no any new failure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23662#issuecomment-2667298076 From syan at openjdk.org Wed Feb 19 01:32:56 2025 From: syan at openjdk.org (SendaoYan) Date: Wed, 19 Feb 2025 01:32:56 GMT Subject: Integrated: 8350197: [UBSAN] Node::dump_idx reported float-cast-overflow In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 13:04:34 GMT, SendaoYan wrote: > Hi all, > > The function of 'Node::dump_idx(bool, outputStream*, Node::DumpConfig*)' in file src/hotspot/share/opto/node.cpp:2430 reported "runtime error: -inf is outside the range of representable values of type 'unsigned int'" by clang17's UndefinedBehaviorSanitizer. > > This PR add an extra check for the argument before pass call to `log10`. Risk is low. > > Additional testing: > > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-x64 with release build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-aarch64 with release build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-x64 with fastdebug build > - [x] Jtreg tests(include tier1/2/3 etc.) on linux-aarch64 with fastdebug build > > Below code snippet demonstrate the undefined behaviour of float-cast-overflow: > > > #include > #include > int input = 0; > int main() > { > printf("result = %lf\n", log10((double)input)); > printf("result = %u\n", (unsigned int)log10((double)input)); > printf("result = %u\n", input==0 ? 0 : (unsigned int)log10((double)input)); > return 0; > } > > > >> clang -fsanitize=undefined log10.c -lm && ./a.out > result = -inf > log10.c:9:27: runtime error: -inf is outside the range of representable values of type 'unsigned int' > SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior log10.c:9:27 in > result = 0 > result = 0 This pull request has now been integrated. Changeset: 04659a40 Author: SendaoYan URL: https://git.openjdk.org/jdk/commit/04659a40736610164855ac161120e63fcd46fe31 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod 8350197: [UBSAN] Node::dump_idx reported float-cast-overflow Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23662 From fyang at openjdk.org Wed Feb 19 02:09:58 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 19 Feb 2025 02:09:58 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v2] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 11:15:48 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> Currently, `string_compare` code is a bit complicated, main reasons include: >> 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. >> 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. >> >> This is not good for code reading and maintaining. >> >> >> So, this patch does following refactoring: >> 1. merge LU and UL code into one, i.e. remove UL code. >> 2. seperate the code into 2 methods: LL/UU and LU/UL. >> 3. some other misc improvement. >> >> I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. >> 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. >> 2. make `SHORT_STRING` case simpler. >> >> >> >> Thanks > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > fix temp registers; move code src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1386: > 1384: > 1385: // Compare longwords > 1386: void C2_MacroAssembler::string_compare_long_LL_UU(Register result, Register str1, Register str2, Do you mind renaming this to `C2_MacroAssembler::string_compare_long_same_encoding`? src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1452: > 1450: beq(tmp1, tmp2, *DONE); > 1451: > 1452: A single empty line will do I think. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1474: > 1472: > 1473: // Compare longwords > 1474: void C2_MacroAssembler::string_compare_long_LU(Register result, Register strL, Register strU, And rename this to `C2_MacroAssembler::string_compare_long_different_encoding`. We can pass one extra param (say `const bool isLU`) to distinguish the two different cases. Also I think we need to pass the `str1` and `str2` from the callsite directly as the final difference calculation needs to repect the order. The current approach doesn't seem correct: it can only distinguish L and U from the two strings, but it doesn't know the order of the two strings at all. Java program that hopefully helps demo the effect of the order of the two strings: String author = "author"; String book = "book"; String duplicateBook = "book"; assertThat(author.compareTo(book)) .isEqualTo(-1); assertThat(book.compareTo(author)) .isEqualTo(1); assertThat(duplicateBook.compareTo(book)) .isEqualTo(0); src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1522: > 1520: beq(tmpL, tmpU, *DONE); > 1521: > 1522: Similar here. Let's keep a single empty line. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1960822148 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1960821131 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1960826250 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1960821485 From kvn at openjdk.org Wed Feb 19 03:17:54 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 03:17:54 GMT Subject: RFR: 8350287: Cleanup SA's support for CodeBlob subclasses In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 02:08:28 GMT, Chris Plummer wrote: > Basically for these 3 CodeBlob types getName() will no longer include the CodeBlob type. I think this is fine: output is informative enough to figure out type of blob. Please check SA tests if they look for old output. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23684#issuecomment-2667409688 From kvn at openjdk.org Wed Feb 19 03:25:52 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 03:25:52 GMT Subject: RFR: 8350287: Cleanup SA's support for CodeBlob subclasses In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 02:08:28 GMT, Chris Plummer wrote: > There is a lot of subclassing of CodeBlob types done in SA to mimic hotspot, but most of it is unnecessary. The generic CodeBlob class can handle all support needed by most of the subclasses. The only subclasses we need to keep around around NMethod, RuntimeStub, and UpcallStub, since they all have special support in SA. I also kept around RuntimeBlob so RuntimeStub can continue to inherit from it and be consistent with hotspot, but it's not actually necessary, and I'm more than happy to remove it also. > > I also cleaned up the PStack support for CodeBlobs. It can just use CodeBlob.getName() rather than trying to figure out the type of the CodeBlob instance to print out type name. This allows us to get rid of most isXXX() APIs. It also provides more useful output in some cases. > > There is some minor loss of functionality in some of the CodeBlob subtypes I removed. For example this is what AdapterBlob.getName() looked like (it is now gone): > > > public String getName() { > return "AdapterBlob: " + super.getName(); > } > > > So now we just use the default CodeBlob.getName(), which is what super.getName() would up execute. I think for AdapterBlob this always returns "I2C/C2I adapters", so now you only get this rather than "AdapterBlob: I2C/C2I adapters". We have a similar loss of getName() detail with MethodHandlesAdapterBlob (now returns "MethodHandles adapters") and VtableBlob (now returns "vtable chunks"). Basically for these 3 CodeBlob types getName() will no longer include the CodeBlob type. I could special case them in CodeBlob.getName() by fetching the kind to determine what the proper name should be. Let me know if you think it is worth it. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/RuntimeStub.java line 47: > 45: private static void initialize(TypeDataBase db) { > 46: Type type = db.lookupType("RuntimeStub"); > 47: callerMustGCArgumentsField = type.getCIntegerField("_caller_must_gc_arguments"); This field is in `CodeBlob` since JDK 23 [JDK-8329433](https://bugs.openjdk.org/browse/JDK-8329433) https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/codeBlob.hpp#L126 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23684#discussion_r1960893284 From cjplummer at openjdk.org Wed Feb 19 04:34:53 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Wed, 19 Feb 2025 04:34:53 GMT Subject: RFR: 8350287: Cleanup SA's support for CodeBlob subclasses In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 03:14:49 GMT, Vladimir Kozlov wrote: > > Basically for these 3 CodeBlob types getName() will no longer include the CodeBlob type. > > I think this is fine: output is informative enough to figure out type of blob. Please check SA tests if they look for old output. The tests are all passing. > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/RuntimeStub.java line 47: > >> 45: private static void initialize(TypeDataBase db) { >> 46: Type type = db.lookupType("RuntimeStub"); >> 47: callerMustGCArgumentsField = type.getCIntegerField("_caller_must_gc_arguments"); > > This field is in `CodeBlob` since JDK 23 [JDK-8329433](https://bugs.openjdk.org/browse/JDK-8329433) https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/codeBlob.hpp#L126 Ah, right. I forgot you had mentioned that. I think that means I can rid of RuntimeStub and RuntimeBlob once this code is moved to CodeBlob. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23684#issuecomment-2667484513 PR Review Comment: https://git.openjdk.org/jdk/pull/23684#discussion_r1960935263 From jkarthikeyan at openjdk.org Wed Feb 19 05:14:31 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 19 Feb 2025 05:14:31 GMT Subject: RFR: 8349563: Improve AbsNode::Value() for integer types Message-ID: Hi all, This is a small patch that improves the implementation of Value() for `AbsINode` and `AbsLNode` by returning the absolute value of the input range. Most of the logic is trivial except for the special case where `_lo == jint_min/jlong_min` which must return the entire type range when encountered, for which I've added a small proof in the comments. I've also added some unit tests and updated the file to limit IR check platforms with more granularity. Thoughts and reviews would be appreciated! ------------- Commit messages: - Improve AbsNode::Value Changes: https://git.openjdk.org/jdk/pull/23685/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23685&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349563 Stats: 145 lines in 2 files changed: 136 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23685.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23685/head:pull/23685 PR: https://git.openjdk.org/jdk/pull/23685 From cjplummer at openjdk.org Wed Feb 19 05:49:56 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Wed, 19 Feb 2025 05:49:56 GMT Subject: RFR: 8350287: Cleanup SA's support for CodeBlob subclasses [v2] In-Reply-To: References: Message-ID: > There is a lot of subclassing of CodeBlob types done in SA to mimic hotspot, but most of it is unnecessary. The generic CodeBlob class can handle all support needed by most of the subclasses. The only subclasses we need to keep around around NMethod, RuntimeStub, and UpcallStub, since they all have special support in SA. I also kept around RuntimeBlob so RuntimeStub can continue to inherit from it and be consistent with hotspot, but it's not actually necessary, and I'm more than happy to remove it also. > > I also cleaned up the PStack support for CodeBlobs. It can just use CodeBlob.getName() rather than trying to figure out the type of the CodeBlob instance to print out type name. This allows us to get rid of most isXXX() APIs. It also provides more useful output in some cases. > > There is some minor loss of functionality in some of the CodeBlob subtypes I removed. For example this is what AdapterBlob.getName() looked like (it is now gone): > > > public String getName() { > return "AdapterBlob: " + super.getName(); > } > > > So now we just use the default CodeBlob.getName(), which is what super.getName() would up execute. I think for AdapterBlob this always returns "I2C/C2I adapters", so now you only get this rather than "AdapterBlob: I2C/C2I adapters". We have a similar loss of getName() detail with MethodHandlesAdapterBlob (now returns "MethodHandles adapters") and VtableBlob (now returns "vtable chunks"). Basically for these 3 CodeBlob types getName() will no longer include the CodeBlob type. I could special case them in CodeBlob.getName() by fetching the kind to determine what the proper name should be. Let me know if you think it is worth it. Chris Plummer has updated the pull request incrementally with one additional commit since the last revision: Minor improvements. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23684/files - new: https://git.openjdk.org/jdk/pull/23684/files/138fed50..e965c310 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23684&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23684&range=00-01 Stats: 110 lines in 4 files changed: 4 ins; 104 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23684.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23684/head:pull/23684 PR: https://git.openjdk.org/jdk/pull/23684 From cjplummer at openjdk.org Wed Feb 19 06:26:53 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Wed, 19 Feb 2025 06:26:53 GMT Subject: RFR: 8350287: Cleanup SA's support for CodeBlob subclasses [v2] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 04:31:11 GMT, Chris Plummer wrote: >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/RuntimeStub.java line 47: >> >>> 45: private static void initialize(TypeDataBase db) { >>> 46: Type type = db.lookupType("RuntimeStub"); >>> 47: callerMustGCArgumentsField = type.getCIntegerField("_caller_must_gc_arguments"); >> >> This field is in `CodeBlob` since JDK 23 [JDK-8329433](https://bugs.openjdk.org/browse/JDK-8329433) https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/codeBlob.hpp#L126 > > Ah, right. I forgot you had mentioned that. I think that means I can rid of RuntimeStub and RuntimeBlob once this code is moved to CodeBlob. Ready for another review. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23684#discussion_r1961019487 From epeter at openjdk.org Wed Feb 19 07:19:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 07:19:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 19:18:34 GMT, Vladimir Kozlov wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > About actual probability value. I was thinking PROB_LIKELY_MAG(3). PROB_LIKELY_MAG(1) will only guarantee that vectorized loop will be first but it could be enough without moving other loop from hot path. Needs testing. @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesting benchmark results there. Does that sound ok? > Can we profile alignment in Interpreter (and C1)? It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2667703955 From rehn at openjdk.org Wed Feb 19 07:26:53 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 19 Feb 2025 07:26:53 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v5] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 00:48:50 GMT, Fei Yang wrote: >> Yes, you're right, fixed. Thanks! > > Thanks for the update. So now the FP reduce multiply will only apply to the Vector-API use case. Did you check the auto-vectorization use case? I suppose the two tests `ProdRed_Double.java` & `ProdRed_Float.java` which are enabled for riscv64 by this PR won't work with this update. It would be nice to get auto-vector also. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1961096917 From epeter at openjdk.org Wed Feb 19 07:42:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 07:42:52 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: Message-ID: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 63 commits: - Merge branch 'master' into JDK-8323582-SW-native-alignment - remove multiversion mark if we break the structure - register opaque with igvn - copyright and rm CFG check - IR rules for all cases - 3 test versions - test changed to unaligned ints - stub for slicing - add Verify/AlignVector runs to test - refactor verify - ... and 53 more: https://git.openjdk.org/jdk/compare/9042aa82...a98ffabf ------------- Changes: https://git.openjdk.org/jdk/pull/22016/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=01 Stats: 1074 lines in 27 files changed: 951 ins; 28 del; 95 mod Patch: https://git.openjdk.org/jdk/pull/22016.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016 PR: https://git.openjdk.org/jdk/pull/22016 From xgong at openjdk.org Wed Feb 19 07:42:55 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 19 Feb 2025 07:42:55 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:20:54 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2735: > 2733: ShouldNotReachHere(); > 2734: } > 2735: mulv(dst, size2, index, tmp1); Can we use vector `lsl` instead of `mul` here, so that we can also support D types for NEON/SVE1 ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1961121512 From bkilambi at openjdk.org Wed Feb 19 08:16:52 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 19 Feb 2025 08:16:52 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 07:40:23 GMT, Xiaohong Gong wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2735: > >> 2733: ShouldNotReachHere(); >> 2734: } >> 2735: mulv(dst, size2, index, tmp1); > > Can we use vector `lsl` instead of `mul` here, so that we can also support D types for NEON/SVE1 ? @XiaohongGong , thanks I'll give it a try and get back. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1961173012 From shade at openjdk.org Wed Feb 19 08:33:54 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 08:33:54 GMT Subject: RFR: 8350210: CTW: Use stackless exceptions In-Reply-To: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> References: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> Message-ID: On Tue, 18 Feb 2025 09:05:36 GMT, Aleksey Shipilev wrote: > Looking at reducing CTW costs in our infra, there are a few simple improvements we can take. > > CTW runners compiling 3rd party JARs normally catch lots of stray exceptions when trying to load non-existing classes, e.g. for resolving the static final fields, or preloading the constant pool. Generating stack traces for these take considerable time, and stack traces for those exceptions are not essential to debug CTW runs. So, we can summarily disable them. > > This has no effect on `applications/ctw/modules`. Compiling a large 3rd party JAR like `solr-core-7.4.0.jar`, for example, improves from ~15.3s to ~12.5s. Need another Reviewer here; maybe @iwanowww, @chhagedorn, @TobiHartmann? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23671#issuecomment-2667904768 From shade at openjdk.org Wed Feb 19 08:33:54 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 08:33:54 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 17:04:41 GMT, Aleksey Shipilev wrote: > Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. > > Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes > - [x] Linux AArch64 server fastdebug, CTW tests still work fine > - [x] Linux AArch64 server fastdebug, `all` Need another Reviewer here; maybe @iwanowww, @chhagedorn, @TobiHartmann? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23668#issuecomment-2667904562 From shade at openjdk.org Wed Feb 19 08:33:54 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 08:33:54 GMT Subject: RFR: 8350211: CTW: Attempt to preload all classes in constant pool In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 11:09:40 GMT, Aleksey Shipilev wrote: > CTW runners do preloading for constant pools ahead of time. I believe this is done to expose more loaded classes to the compilations, so to extend the compilation scope. > > Unfortunately, current code catches the first exception when loading the constant pool and stops preloading. This routinely happens when CTW runner processes a 3rd party JAR, where dependencies might normally be in other JARs. > > I believe we should attempt to resolve all constant pool entries when preloading is requested. This would likely expand the scope of CTW testing. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` still passes Need another Reviewer here; maybe @vnkozlov, @chhagedorn, @TobiHartmann? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23673#issuecomment-2667905089 From dfenacci at openjdk.org Wed Feb 19 09:08:57 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Wed, 19 Feb 2025 09:08:57 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v4] In-Reply-To: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> References: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> Message-ID: On Tue, 18 Feb 2025 12:13:45 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: >> >> ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) >> >> #### Testing >> >> - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). >> >> - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Remove redundant 'alias_field' property Thanks a lot for the improvement @robcasloz. Like @dlunde I had a few concerns about `dump_spec` being parsed (but testing it a bit didn't seem to reveal any issue). BTW this looks like a good improvement in the direction of making `dump_spec` a bit more understandable (or not needing `dump_spec` in the first place) ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/23621#pullrequestreview-2626098856 From rcastanedalo at openjdk.org Wed Feb 19 09:20:00 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 19 Feb 2025 09:20:00 GMT Subject: Integrated: 8350006: IGV: show memory slices as type information In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 19:47:25 GMT, Roberto Casta?eda Lozano wrote: > This changeset extends the "Show types" filter in IGV to show the memory slice corresponding to each memory node. This information can be useful e.g. in the ongoing investigation of [JDK-8333393](https://bugs.openjdk.org/browse/JDK-8333393). Here is an example of a memory subgraph with the extended "Show types" filter enabled: > > ![example](https://github.com/user-attachments/assets/1810a257-c6d6-4b6e-9638-5bbef1c48717) > > #### Testing > > - tier1 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > > - Tested IGV manually on a few selected graphs. Tested automatically that displaying thousands of graphs with the "Show types" filter enabled does not trigger any assertion failure (by enabling assertions, instrumenting IGV to display parsed graphs eagerly, and running `java -Xbatch -XX:-TieredCompilation -XX:PrintIdealGraphLevel=3`). This pull request has now been integrated. Changeset: 0ef1c409 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/0ef1c40991e703592fc79325bda1a6d2fc6caf4e Stats: 40 lines in 3 files changed: 33 ins; 1 del; 6 mod 8350006: IGV: show memory slices as type information Reviewed-by: dlunden, chagedorn, dfenacci ------------- PR: https://git.openjdk.org/jdk/pull/23621 From rcastanedalo at openjdk.org Wed Feb 19 09:19:59 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 19 Feb 2025 09:19:59 GMT Subject: RFR: 8350006: IGV: show memory slices as type information [v4] In-Reply-To: References: <2HA3PtprmSP9ymdI0ZmaZvbATXe26DpdICQw0LnZvUY=.67910aa3-44fc-420b-a464-8b00f48ed536@github.com> Message-ID: On Tue, 18 Feb 2025 17:20:17 GMT, Roberto Casta?eda Lozano wrote: >> Still good! > >> Still good! > > Thanks, Daniel! > Thanks a lot for the improvement @robcasloz. Like @dlunde I had a few concerns about `dump_spec` being parsed (but testing it a bit didn't seem to reveal any issue). BTW this looks like a good improvement in the direction of making `dump_spec` a bit more understandable (or not needing `dump_spec` in the first place) Thanks, Damon! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23621#issuecomment-2668010340 From rrich at openjdk.org Wed Feb 19 09:25:55 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 19 Feb 2025 09:25:55 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v2] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Mon, 17 Feb 2025 11:27:17 GMT, Richard Reingruber wrote: >>> I think you can make the assertion a little stricter like this [reinrich at 9c3c8a3](https://github.com/reinrich/jdk/commit/9c3c8a33a29b9ae6c4c703992b306dc0cbbcd2f0). >> >> Regarding this stricter version, why are you using is_bottom_frame instead of is_top_frame? The deoptimization code seems to name the most recent leaf frame "top". That sounds like what frame::top_ijava_frame_abi_size is for too. > >> > I think you can make the assertion a little stricter like this [reinrich at 9c3c8a3](https://github.com/reinrich/jdk/commit/9c3c8a33a29b9ae6c4c703992b306dc0cbbcd2f0). >> >> Regarding this stricter version, why are you using is_bottom_frame instead of is_top_frame? The deoptimization code seems to name the most recent leaf frame "top". That sounds like what frame::top_ijava_frame_abi_size is for too. > > Correct, the top frame has a frame::top_ijava_frame_abi but the assertion is about the abi section in the current frame's caller and the the bottom frame's caller also has a top_ijava_frame_abi because i2c doesn't modify it. > > Continue reading if you're interested in more details... > > As said the i2c adapter does *not* trimm the caller frame as the interpreter would, > replacing its large `top_ijava_frame_abi` with a smaller > `parent_ijava_frame_abi`. > > > > Example: compiled frame DEOPTEE is replaced with 3 interpreted frames > > Stack before deoptimization > > | | > | Interpreted CALLER | > | of DEOPTEE frame | > | | > +------------------------+ > | | > | top_ijava_frame_abi | > | | > +========================+ > | | > | Compiled | > | DEOPTEE | > | | > +------------------------+ > | java_abi | > +========================+ > > > Stack when assertion is checked > (i.e. after DEOPTEE was replaced by corresponding inter. frames) > > | | > | Interpreted CALLER | > | of DEOPTEE frame | > | | > +------------------------+ > | | > | top_ijava_frame_abi | <- i2c keeps large abi > | | > +========================+ > | | <- bottom frame > | Interpreted Frame 0 | > | corresp. to DEOPTEE | > | | > +------------------------+ > | parent_ijava_frame_abi | > +========================+ > | | > | Interpreted Frame 1 | > | (inlined by DEOPTEE) | > | | > +------------------------+ > | parent_ijava_frame_abi | > +========================+ > | | <- top frame > | Interpreted Frame 2 | > | (inlined by DEOPTEE) | > | | > +------------------------+ > | | > | top_ijava_frame_abi | > | | > +========================+ > > Notes: > (refering to the frame sections rather than the C++ types) > > - top_ijava_frame_abi comp... > @reinrich OK, got it! I pushed your change. Thanks. > Could you also comment on if we could use the value of sender_sp here instead? You mean for the calculation of `l2` at L135? sender_sp has room for `Method::max_stack()`. Using it would be less strict. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2668028520 From aph at openjdk.org Wed Feb 19 09:30:55 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 19 Feb 2025 09:30:55 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 03:00:13 GMT, Xiaohong Gong wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 6748: > >> 6746: %} >> 6747: >> 6748: instruct vselect_from_two_vectors(vReg dst, vReg_V17 src1, vReg_V18 src2, vReg index) %{ > > Could you please add comment before the rule why `v17` and `v18` are used explicitly here? I'm still curious. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1961293400 From bkilambi at openjdk.org Wed Feb 19 09:55:55 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 19 Feb 2025 09:55:55 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 09:28:15 GMT, Andrew Haley wrote: >> src/hotspot/cpu/aarch64/aarch64_vector.ad line 6748: >> >>> 6746: %} >>> 6747: >>> 6748: instruct vselect_from_two_vectors(vReg dst, vReg_V17 src1, vReg_V18 src2, vReg index) %{ >> >> Could you please add comment before the rule why `v17` and `v18` are used explicitly here? > > I'm still curious. Hi @theRealAph , apologies for the late response. The tbl instruction needs both the source registers to be consecutive and I could not find a way to make the register allocator choose two consecutive registers for this operation and decided to hard code them. As v0-v7 are used for function arguments, v8-v15 are non-volatile which are not needed for this purpose (as we dont want to be preserving these values across function calls), I chose two of the volatile registers from v16-v31 for the source registers. Please let me know if this is the right way to approach. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1961337539 From dfenacci at openjdk.org Wed Feb 19 10:21:56 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Wed, 19 Feb 2025 10:21:56 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v3] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 10:31:26 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. >> >> Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: >> >> >> java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 >> >> >> produces the following visualization for the `Initial spilling` phase: >> >> ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) >> >> Live ranges are first-class IGV entities, meaning that the user can: >> >> - search, select, and extract them; >> >> ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) >> >> - examine their properties in the `Properties` window or via tooltips; >> >> ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) >> >> - navigate to related IGV entities via a pop-up menu; and >> >> ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) >> >> - program filters that act om them according to their properties. >> >> ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) >> >> Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: >> >> ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary whitespace Thanks for the fix Roberto. I noticed that the live ranges are not saved when saving the graph into an xml file (`LIVE_RANGES_ELEMENT` and related tags don't seem to be exported in `Printer.java`). Is this perhaps something you did intentionally (maybe to be added in the future)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2668185963 From chagedorn at openjdk.org Wed Feb 19 10:34:53 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Feb 2025 10:34:53 GMT Subject: RFR: 8350211: CTW: Attempt to preload all classes in constant pool In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 11:09:40 GMT, Aleksey Shipilev wrote: > CTW runners do preloading for constant pools ahead of time. I believe this is done to expose more loaded classes to the compilations, so to extend the compilation scope. > > Unfortunately, current code catches the first exception when loading the constant pool and stops preloading. This routinely happens when CTW runner processes a 3rd party JAR, where dependencies might normally be in other JARs. > > I believe we should attempt to resolve all constant pool entries when preloading is requested. This would likely expand the scope of CTW testing. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` still passes Looks good to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23673#pullrequestreview-2626339929 From shade at openjdk.org Wed Feb 19 11:04:57 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 11:04:57 GMT Subject: RFR: 8350211: CTW: Attempt to preload all classes in constant pool In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 11:09:40 GMT, Aleksey Shipilev wrote: > CTW runners do preloading for constant pools ahead of time. I believe this is done to expose more loaded classes to the compilations, so to extend the compilation scope. > > Unfortunately, current code catches the first exception when loading the constant pool and stops preloading. This routinely happens when CTW runner processes a 3rd party JAR, where dependencies might normally be in other JARs. > > I believe we should attempt to resolve all constant pool entries when preloading is requested. This would likely expand the scope of CTW testing. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` still passes Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23673#issuecomment-2668300914 From shade at openjdk.org Wed Feb 19 11:04:58 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 11:04:58 GMT Subject: Integrated: 8350211: CTW: Attempt to preload all classes in constant pool In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 11:09:40 GMT, Aleksey Shipilev wrote: > CTW runners do preloading for constant pools ahead of time. I believe this is done to expose more loaded classes to the compilations, so to extend the compilation scope. > > Unfortunately, current code catches the first exception when loading the constant pool and stops preloading. This routinely happens when CTW runner processes a 3rd party JAR, where dependencies might normally be in other JARs. > > I believe we should attempt to resolve all constant pool entries when preloading is requested. This would likely expand the scope of CTW testing. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` still passes This pull request has now been integrated. Changeset: d13fd573 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/d13fd5738f8a3d4b4009c2e15cfd967332d97bbd Stats: 12 lines in 1 file changed: 4 ins; 5 del; 3 mod 8350211: CTW: Attempt to preload all classes in constant pool Reviewed-by: vlivanov, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23673 From chagedorn at openjdk.org Wed Feb 19 11:29:53 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Feb 2025 11:29:53 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: <7U9HFLFIsCGzILlHSGe1O_Qe21SIlyiKRwBpx8ZcSDw=.5ef80c59-39b6-4c22-8ceb-1b0c42a6bd9a@github.com> On Mon, 17 Feb 2025 17:04:41 GMT, Aleksey Shipilev wrote: > Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. > > Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes > - [x] Linux AArch64 server fastdebug, CTW tests still work fine > - [x] Linux AArch64 server fastdebug, `all` Looks good to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23668#pullrequestreview-2626503209 From chagedorn at openjdk.org Wed Feb 19 11:31:53 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 19 Feb 2025 11:31:53 GMT Subject: RFR: 8350210: CTW: Use stackless exceptions In-Reply-To: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> References: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> Message-ID: <3vYf99C--1wwWp8K5WmWTuOObAUV0wk1Vx_0MGg1ZeY=.0b08ea8c-a5ea-4b7b-a198-e09a9b5fe7e5@github.com> On Tue, 18 Feb 2025 09:05:36 GMT, Aleksey Shipilev wrote: > Looking at reducing CTW costs in our infra, there are a few simple improvements we can take. > > CTW runners compiling 3rd party JARs normally catch lots of stray exceptions when trying to load non-existing classes, e.g. for resolving the static final fields, or preloading the constant pool. Generating stack traces for these take considerable time, and stack traces for those exceptions are not essential to debug CTW runs. So, we can summarily disable them. > > This has no effect on `applications/ctw/modules`. Compiling a large 3rd party JAR like `solr-core-7.4.0.jar`, for example, improves from ~15.3s to ~12.5s. That's a good idea! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23671#pullrequestreview-2626507167 From shade at openjdk.org Wed Feb 19 11:36:59 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 11:36:59 GMT Subject: RFR: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 17:04:41 GMT, Aleksey Shipilev wrote: > Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. > > Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes > - [x] Linux AArch64 server fastdebug, CTW tests still work fine > - [x] Linux AArch64 server fastdebug, `all` Thanks! Let's see if there are more surprises... ------------- PR Comment: https://git.openjdk.org/jdk/pull/23668#issuecomment-2668374411 From shade at openjdk.org Wed Feb 19 11:36:59 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 11:36:59 GMT Subject: Integrated: 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 In-Reply-To: References: Message-ID: <4acc4aoz0XK8NO8FlUWNlAYsUDKTbh7ULo_cukWbv24=.9c288d74-fc2b-49be-8f6e-32f14d44ef09@github.com> On Mon, 17 Feb 2025 17:04:41 GMT, Aleksey Shipilev wrote: > Recently added hunk in `CompilationPolicy::selected_task` was supposed to target CTW runs, that wanted to omit any level changes. But there are tests that _do test_ level changes, and they submit `Whitebox` requests. One of those tests is `compiler/tiered/Level2RecompilationTest.java`. So it looks like we need to disambiguate the "CTW" uses and "general Whitebox" uses. > > Looks like checking for `-Xbatch` does the trick for CTW. It is not super-clean, but it works, and it matches other exceptions in around compilation policy, e.g. when checking for `-Xcomp`, etc. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `compiler/tiered/Level2RecompilationTest.java` now passes > - [x] Linux AArch64 server fastdebug, CTW tests still work fine > - [x] Linux AArch64 server fastdebug, `all` This pull request has now been integrated. Changeset: 79db2d41 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/79db2d4186eb2af827295581464be8602ac95f98 Stats: 5 lines in 2 files changed: 0 ins; 2 del; 3 mod 8350159: compiler/tiered/Level2RecompilationTest.java fails after JDK-8349915 Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23668 From shade at openjdk.org Wed Feb 19 11:37:56 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 11:37:56 GMT Subject: RFR: 8350210: CTW: Use stackless exceptions In-Reply-To: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> References: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> Message-ID: On Tue, 18 Feb 2025 09:05:36 GMT, Aleksey Shipilev wrote: > Looking at reducing CTW costs in our infra, there are a few simple improvements we can take. > > CTW runners compiling 3rd party JARs normally catch lots of stray exceptions when trying to load non-existing classes, e.g. for resolving the static final fields, or preloading the constant pool. Generating stack traces for these take considerable time, and stack traces for those exceptions are not essential to debug CTW runs. So, we can summarily disable them. > > This has no effect on `applications/ctw/modules`. Compiling a large 3rd party JAR like `solr-core-7.4.0.jar`, for example, improves from ~15.3s to ~12.5s. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23671#issuecomment-2668375659 From shade at openjdk.org Wed Feb 19 11:37:56 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Feb 2025 11:37:56 GMT Subject: Integrated: 8350210: CTW: Use stackless exceptions In-Reply-To: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> References: <9CmFDkEZMlMOZStFRYB-Czn6kpC4TWWgR-z_FGUIIKg=.b9d68ce9-bf31-4150-b86a-dbfd89139384@github.com> Message-ID: On Tue, 18 Feb 2025 09:05:36 GMT, Aleksey Shipilev wrote: > Looking at reducing CTW costs in our infra, there are a few simple improvements we can take. > > CTW runners compiling 3rd party JARs normally catch lots of stray exceptions when trying to load non-existing classes, e.g. for resolving the static final fields, or preloading the constant pool. Generating stack traces for these take considerable time, and stack traces for those exceptions are not essential to debug CTW runs. So, we can summarily disable them. > > This has no effect on `applications/ctw/modules`. Compiling a large 3rd party JAR like `solr-core-7.4.0.jar`, for example, improves from ~15.3s to ~12.5s. This pull request has now been integrated. Changeset: 2353f3e2 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/2353f3e2f18ccaa972ee7a292d5a45035c647881 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8350210: CTW: Use stackless exceptions Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23671 From jbhateja at openjdk.org Wed Feb 19 11:46:00 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 19 Feb 2025 11:46:00 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v18] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> <90MwDac7Q83dK8KDagHOst15xV-quGZKVE8n2tP9dsk=.351ed042-9a69-4186-b134-8c3cb6fef6cd@github.com> Message-ID: <8RKUecB8fTeQDC2-m6lzhoFEM58_RKohrp7QQ-a-dQ8=.51dc4b00-c30c-4e2d-b2cb-0d0e960a1bca@github.com> On Thu, 13 Feb 2025 09:23:54 GMT, Emanuel Peter wrote: >> Hi @eme64 , All comments addressed, looking forward to your approval > > @jatin-bhateja Perfect, it looks good now. Let me run testing one more time just to be sure. Please ping me in a day or so for the results! Hi @eme64 , Let us know if its good to land. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2668396938 From dfenacci at openjdk.org Wed Feb 19 12:06:53 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Wed, 19 Feb 2025 12:06:53 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: <8ViA6x7l9mMjBEEfKR3LSICyAh4AANl_mq6wP5TEt9Y=.33b9b461-814f-48df-97da-214d0d44e4c3@github.com> References: <8ViA6x7l9mMjBEEfKR3LSICyAh4AANl_mq6wP5TEt9Y=.33b9b461-814f-48df-97da-214d0d44e4c3@github.com> Message-ID: On Tue, 18 Feb 2025 08:52:56 GMT, Roberto Casta?eda Lozano wrote: >>> @robcasloz thanks a lot for this amazing improvement! >>> >>> Just a quick one: I noticed that, with your `java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4` example with the `Initial spilling` phase selected, if I click on the live range icon image to make it disappear, I get a `NullPointerException` image (no further details apart from the exception) >> >> Thanks for the report Damon, will investigate! > >> Thanks for the report Damon, will investigate! > > Commit 00169223 should fix the issue, thanks again. @robcasloz, I was a bit puzzled by live ranges with Phi nodes but then I noticed that in the description you mention that they are treated somewhat in a special way: > To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. I thought that variables that are joined by the Phi node are still live at the Phi node. Is this not the case? Or possibly you meant that it is better not to consider them live there (e.g. to reduce the number of live ranges in the block with the Phi node)? Irrespective of that, would it be feasible to add a "termination dash" at the bottom of the line (e.g. at the bottom of `L8`)? image ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2668456896 From roland at openjdk.org Wed Feb 19 12:14:56 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 19 Feb 2025 12:14:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> On Tue, 18 Feb 2025 17:20:23 GMT, Emanuel Peter wrote: > Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect. Wouldn't usual optimizations be applied to the slow loop as well (pre/main/post, unrolling)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668476997 From duke at openjdk.org Wed Feb 19 12:52:57 2025 From: duke at openjdk.org (Marc Chevalier) Date: Wed, 19 Feb 2025 12:52:57 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed Message-ID: Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. Thanks, Marc ------------- Commit messages: - fix test - i - format - reformat - comment - fix test - Remove frem/drem in Ideal Changes: https://git.openjdk.org/jdk/pull/23694/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349523 Stats: 72 lines in 4 files changed: 70 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23694.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23694/head:pull/23694 PR: https://git.openjdk.org/jdk/pull/23694 From epeter at openjdk.org Wed Feb 19 13:08:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 13:08:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 12:12:27 GMT, Roland Westrelin wrote: > > Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect. > > Wouldn't usual optimizations be applied to the slow loop as well (pre/main/post, unrolling)? That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop. Does that make sense? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668601537 From roland at openjdk.org Wed Feb 19 13:20:56 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 19 Feb 2025 13:20:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 13:06:02 GMT, Emanuel Peter wrote: > That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop. So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668625485 From epeter at openjdk.org Wed Feb 19 13:20:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 13:20:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 13:15:46 GMT, Roland Westrelin wrote: > > That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop. > > So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)? Exactly. In a sense that would give you similar results as with unswitching, where we also possibly optimize both branches / loops. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668632094 From mdoerr at openjdk.org Wed Feb 19 13:25:09 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 19 Feb 2025 13:25:09 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: Message-ID: > PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). > The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. > > Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` > > Before this patch (C code) > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op > ... > SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op > SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op > SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op > SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op > SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op > SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op > SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op > SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op > SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op > SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op > SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op > SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op > SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op > SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op > SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op > ... > SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op > SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op > SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op > SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op > SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op > SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op > SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op > SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op > SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op > SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op > SecondarySupersLookup.testPositive64 avgt 15 29.932 ? 1.002 ns/op > > > After this patch (assemble... Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Unroll repne_scan loop for better performance. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23602/files - new: https://git.openjdk.org/jdk/pull/23602/files/84c7aacc..ffd2b2a9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23602&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23602&range=01-02 Stats: 24 lines in 1 file changed: 9 ins; 8 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23602.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23602/head:pull/23602 PR: https://git.openjdk.org/jdk/pull/23602 From roland at openjdk.org Wed Feb 19 13:28:55 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 19 Feb 2025 13:28:55 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 13:18:18 GMT, Emanuel Peter wrote: > > > That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop. > > > > > > So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)? > > Exactly. In a sense that would give you similar results as with unswitching, where we also possibly optimize both branches / loops. So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668653066 From mli at openjdk.org Wed Feb 19 14:02:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 19 Feb 2025 14:02:55 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v5] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 00:48:50 GMT, Fei Yang wrote: >> Yes, you're right, fixed. Thanks! > > Thanks for the update. So now the FP reduce multiply will only apply to the Vector-API use case. Did you check the auto-vectorization use case? I suppose the two tests `ProdRed_Double.java` & `ProdRed_Float.java` which are enabled for riscv64 by this PR won't work with this update. @RealFYang I think in this PR, I need to remove the Reduction intrinsics for V & D, as seems to me there is no way to supply an intrinsic for unordered one, but not for the ordered one, it will lead to AD file error. Or maybe can you suggest how to do it? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1961733548 From mli at openjdk.org Wed Feb 19 14:02:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 19 Feb 2025 14:02:55 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v5] In-Reply-To: References: Message-ID: <1CEg8-e6CM-QU3IGymCVJu3AuhMcMjK80m40a5gA16Y=.eb107a1f-b4eb-4eea-bded-a258e4528fda@github.com> On Wed, 19 Feb 2025 07:24:31 GMT, Robbin Ehn wrote: >> Thanks for the update. So now the FP reduce multiply will only apply to the Vector-API use case. Did you check the auto-vectorization use case? I suppose the two tests `ProdRed_Double.java` & `ProdRed_Float.java` which are enabled for riscv64 by this PR won't work with this update. > > It would be nice to get auto-vector also. @robehn Yes, it would be, but currenlty there is no easy way to do it. Maybe we can figure out how to do it in the future, but it would not be in this pr. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1961734922 From rcastanedalo at openjdk.org Wed Feb 19 15:17:54 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 19 Feb 2025 15:17:54 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v3] In-Reply-To: References: Message-ID: <2qylkM_X00_j1T2n41H0Cpuu9G05mCS3CnJAbygO-x8=.19332e68-be26-4dbd-9dcc-5a69d4054284@github.com> On Wed, 19 Feb 2025 10:19:33 GMT, Damon Fenacci wrote: > I noticed that the live ranges are not saved when saving the graph into an xml file (`LIVE_RANGES_ELEMENT` and related tags don't seem to be exported in `Printer.java`). Is this perhaps something you did intentionally (maybe to be added in the future)? Good catch, thanks! No, I just overlooked this use case. Will fix. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2668952783 From dlunden at openjdk.org Wed Feb 19 15:23:11 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 19 Feb 2025 15:23:11 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v2] In-Reply-To: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: > When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. > > ### Changeset > > It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. > > To illustrate the idealization and how it resolves this issue, consider the example below. > > ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) > > `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. > > We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. > > The changeset consists of the following changes. > - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. > - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. > - Add multiple new regression tests in `TestGCMLoadPlacement.java`. > > For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/13394882532) > - `tier1` to `tier4` (an... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Remove test that no longer reproduces the issue ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23691/files - new: https://git.openjdk.org/jdk/pull/23691/files/893bfb30..7f702a68 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23691&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23691&range=00-01 Stats: 40 lines in 1 file changed: 0 ins; 34 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23691.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23691/head:pull/23691 PR: https://git.openjdk.org/jdk/pull/23691 From epeter at openjdk.org Wed Feb 19 15:25:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 15:25:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 13:26:37 GMT, Roland Westrelin wrote: > So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x. Yes, if you get to the point where you add a multi-version-if condition, i.e. where SuperWord has decided it needs a speculative assumption (here for alignment, later for aliasing), then we get the whole loop 2x. I suppose we could try to make the pre-main-post loop more complicated and just multi-version the main-loop, but that sounds much more complicated. Do you see any better way than having the 2x code size if we need both a slow and fast loop? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668974247 From rrich at openjdk.org Wed Feb 19 15:26:53 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 19 Feb 2025 15:26:53 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 13:25:09 GMT, Martin Doerr wrote: >> PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). >> The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. >> >> Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` >> >> Before this patch (C code) >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op >> ... >> SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op >> ... >> SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op >> SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op >> SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op >> SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op >> SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op >> SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op >> SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op >> SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op >> SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op >> SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op >> SecondarySupersLookup.tes... > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Unroll repne_scan loop for better performance. src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp line 2845: > 2843: if (dest == Runtime1::entry_for(C1StubId::register_finalizer_id) || > 2844: dest == Runtime1::entry_for(C1StubId::new_multi_array_id ) || > 2845: dest == Runtime1::entry_for(C1StubId::is_instance_of_id )) { Should there be an assertion that `dest` is in the CodeCache? Or even use that check as condition to emit the optimized call? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23602#discussion_r1961888058 From rcastanedalo at openjdk.org Wed Feb 19 15:27:55 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 19 Feb 2025 15:27:55 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: References: <8ViA6x7l9mMjBEEfKR3LSICyAh4AANl_mq6wP5TEt9Y=.33b9b461-814f-48df-97da-214d0d44e4c3@github.com> Message-ID: On Wed, 19 Feb 2025 12:03:49 GMT, Damon Fenacci wrote: > I thought that variables that are joined by the Phi node are still live at the Phi node. Is this not the case? No, the usual "multiplex-like" liveness semantics for Phi instructions is to consider the joined variables live-out of their corresponding predecessor blocks and the resulting variable live-in in its block (and defined in parallel with other Phi definitions in the block), see e.g. Definition 4 in Ch. 21.2 in [the SSA book draft](https://pfalcon.github.io/ssabook/latest/book-full.pdf). This is also in line with [C2's handling of Phi nodes in liveness analysis](https://github.com/openjdk/jdk/blob/efbad00c4d7931177ccc5e9bce3b30dfbac94010/src/hotspot/share/opto/live.cpp#L128-L147). > Irrespective of that, would it be feasible to add a "termination dash" at the bottom of the line (e.g. at the bottom of `L8`)? Yes, that is a good idea, will do, thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2668981015 From kvn at openjdk.org Wed Feb 19 15:50:53 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 15:50:53 GMT Subject: RFR: 8350287: Cleanup SA's support for CodeBlob subclasses [v2] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 05:49:56 GMT, Chris Plummer wrote: >> There is a lot of subclassing of CodeBlob types done in SA to mimic hotspot, but most of it is unnecessary. The generic CodeBlob class can handle all support needed by most of the subclasses. The only subclasses we need to keep around around NMethod, RuntimeStub, and UpcallStub, since they all have special support in SA. I also kept around RuntimeBlob so RuntimeStub can continue to inherit from it and be consistent with hotspot, but it's not actually necessary, and I'm more than happy to remove it also. >> >> I also cleaned up the PStack support for CodeBlobs. It can just use CodeBlob.getName() rather than trying to figure out the type of the CodeBlob instance to print out type name. This allows us to get rid of most isXXX() APIs. It also provides more useful output in some cases. >> >> There is some minor loss of functionality in some of the CodeBlob subtypes I removed. For example this is what AdapterBlob.getName() looked like (it is now gone): >> >> >> public String getName() { >> return "AdapterBlob: " + super.getName(); >> } >> >> >> So now we just use the default CodeBlob.getName(), which is what super.getName() would up execute. I think for AdapterBlob this always returns "I2C/C2I adapters", so now you only get this rather than "AdapterBlob: I2C/C2I adapters". We have a similar loss of getName() detail with MethodHandlesAdapterBlob (now returns "MethodHandles adapters") and VtableBlob (now returns "vtable chunks"). Basically for these 3 CodeBlob types getName() will no longer include the CodeBlob type. I could special case them in CodeBlob.getName() by fetching the kind to determine what the proper name should be. Let me know if you think it is worth it. > > Chris Plummer has updated the pull request incrementally with one additional commit since the last revision: > > Minor improvements. LGTM ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23684#pullrequestreview-2627250925 From aph at openjdk.org Wed Feb 19 16:04:55 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 19 Feb 2025 16:04:55 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 13:25:09 GMT, Martin Doerr wrote: >> PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). >> The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. >> >> Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` >> >> Before this patch (C code) >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op >> ... >> SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op >> ... >> SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op >> SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op >> SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op >> SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op >> SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op >> SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op >> SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op >> SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op >> SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op >> SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op >> SecondarySupersLookup.tes... > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Unroll repne_scan loop for better performance. Is this change actually worthwhile on PPC? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23602#issuecomment-2669083789 From mdoerr at openjdk.org Wed Feb 19 16:08:54 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 19 Feb 2025 16:08:54 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:02:30 GMT, Andrew Haley wrote: > Is this change actually worthwhile on PPC? It's not a big gain. I've rather implemented it for parity with the other platforms. Maybe we want to remove the C++ version at some point of time? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23602#issuecomment-2669094773 From kvn at openjdk.org Wed Feb 19 16:08:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 16:08:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 07:17:30 GMT, Emanuel Peter wrote: > The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesting benchmark results there. > > Does that sound ok? Yes, it is good plan. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669094347 From liach at openjdk.org Wed Feb 19 16:16:55 2025 From: liach at openjdk.org (Chen Liang) Date: Wed, 19 Feb 2025 16:16:55 GMT Subject: RFR: 8349563: Improve AbsNode::Value() for integer types In-Reply-To: References: Message-ID: <4BGh_S5KBKdofXCOmj6e7HYCR4GUSi9-ShxqW-h4oNQ=.2cb7d760-4f42-4e95-b993-39f99931c1d9@github.com> On Wed, 19 Feb 2025 05:10:04 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a small patch that improves the implementation of Value() for `AbsINode` and `AbsLNode` by returning the absolute value of the input range. Most of the logic is trivial except for the special case where `_lo == jint_min/jlong_min` which must return the entire type range when encountered, for which I've added a small proof in the comments. I've also added some unit tests and updated the file to limit IR check platforms with more granularity. > > Thoughts and reviews would be appreciated! src/hotspot/share/opto/subnode.cpp line 1941: > 1939: > 1940: if (lo_abs < 0) { > 1941: assert(lo_abs == std::numeric_limits::min(), "uabs(t->_lo) must be min value if negative!"); I think asserting `t->_lo` to be min is more straightforward, and also indicates `(t->_lo) + 1`, which yields max, is in the type. We can simplify the comment below too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23685#discussion_r1961978914 From kvn at openjdk.org Wed Feb 19 16:18:57 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 16:18:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 07:17:30 GMT, Emanuel Peter wrote: > > Can we profile alignment in Interpreter (and C1)? > > It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. > > What do you think? You should not worry about `-Xcomp` it is testing flag - we can use some default there. I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes. I will look on changes more later. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669115673 From epeter at openjdk.org Wed Feb 19 16:18:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 16:18:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov wrote: > I am fine if you think profiling will not bring us much benefits Yeah, I think it is a good assumption that we will always get aligned and non-aliasing inputs. And if that is not the case, then this is a rare case, and it should be ok to pay the price of recompilation, I think. > I will look on changes more later. Thanks you :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669122452 From bulasevich at openjdk.org Wed Feb 19 16:21:27 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 19 Feb 2025 16:21:27 GMT Subject: RFR: 8350344: Cross-build failure: _vptr conflicts with internal virtual virtual table field Message-ID: With this change, I aim to address a build issue caused by the recent #23533 update, which introduced the CodeBlob _vptr field. This naming has led to build failures in HotSpot, reproduced on the Linaro GCC cross-toolchain for AArch64 and ARM32. src/hotspot/share/code/codeBlob.hpp:349:21: error: member '_vptr' conflicts with virtual function table field name static const Vptr _vptr; ^~~~~ src/hotspot/share/code/codeBlob.hpp:437:21: error: member '_vptr' conflicts with virtual function table field name static const Vptr _vptr; ^~~~~ src/hotspot/share/code/codeBlob.hpp:477:21: error: member '_vptr' conflicts with virtual function table field name static const Vptr _vptr; ^~~~~ src/hotspot/share/code/codeBlob.hpp:560:21: error: member '_vptr' conflicts with virtual function table field name static const Vptr _vptr; ------------- Commit messages: - 8350344: Cross-build failure: _vptr conflicts with internal virtual table field Changes: https://git.openjdk.org/jdk/pull/23703/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23703&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350344 Stats: 23 lines in 3 files changed: 0 ins; 0 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/23703.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23703/head:pull/23703 PR: https://git.openjdk.org/jdk/pull/23703 From bulasevich at openjdk.org Wed Feb 19 16:21:27 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 19 Feb 2025 16:21:27 GMT Subject: RFR: 8350344: Cross-build failure: _vptr conflicts with internal virtual virtual table field In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:13:16 GMT, Boris Ulasevich wrote: > With this change, I aim to address a build issue caused by the recent #23533 update, which introduced the CodeBlob _vptr field. This naming has led to build failures in HotSpot, reproduced on the Linaro GCC cross-toolchain for AArch64 and ARM32. > > src/hotspot/share/code/codeBlob.hpp:349:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:437:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:477:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:560:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; @vnkozlov It seems that this is rather a toolchain bug than an issue in the HotSpot sources. However, do you think we could rename _vptr to resolve the issue? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23703#issuecomment-2669125748 From mdoerr at openjdk.org Wed Feb 19 16:24:44 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 19 Feb 2025 16:24:44 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v4] In-Reply-To: References: Message-ID: > PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). > The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. > > Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` > > Before this patch (C code) > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op > ... > SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op > SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op > SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op > SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op > SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op > SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op > SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op > SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op > SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op > SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op > SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op > SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op > SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op > SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op > SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op > ... > SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op > SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op > SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op > SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op > SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op > SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op > SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op > SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op > SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op > SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op > SecondarySupersLookup.testPositive64 avgt 15 29.932 ? 1.002 ns/op > > > After this patch (assemble... Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Add assertion. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23602/files - new: https://git.openjdk.org/jdk/pull/23602/files/ffd2b2a9..8b273448 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23602&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23602&range=02-03 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23602.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23602/head:pull/23602 PR: https://git.openjdk.org/jdk/pull/23602 From mdoerr at openjdk.org Wed Feb 19 16:27:55 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 19 Feb 2025 16:27:55 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: Message-ID: <6OgJdG6W07V8BYErKC8Nmje6UYkfiWCujW5v308zd9w=.491380bb-b7de-4549-bd66-911163176d1f@github.com> On Wed, 19 Feb 2025 15:24:02 GMT, Richard Reingruber wrote: >> Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: >> >> Unroll repne_scan loop for better performance. > > src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp line 2845: > >> 2843: if (dest == Runtime1::entry_for(C1StubId::register_finalizer_id) || >> 2844: dest == Runtime1::entry_for(C1StubId::new_multi_array_id ) || >> 2845: dest == Runtime1::entry_for(C1StubId::is_instance_of_id )) { > > Should there be an assertion that `dest` is in the CodeCache? > Or even use that check as condition to emit the optimized call? I've added the assertion. I prefer having the special stubs listed explicitly. I hope that there will not be many more. Otherwise, we could still change it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23602#discussion_r1961998128 From rrich at openjdk.org Wed Feb 19 17:01:58 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 19 Feb 2025 17:01:58 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v4] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:24:44 GMT, Martin Doerr wrote: >> PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). >> The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. >> >> Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` >> >> Before this patch (C code) >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op >> ... >> SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op >> ... >> SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op >> SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op >> SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op >> SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op >> SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op >> SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op >> SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op >> SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op >> SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op >> SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op >> SecondarySupersLookup.tes... > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add assertion. Looks good! Cheers, Richard. ------------- Marked as reviewed by rrich (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23602#pullrequestreview-2627444921 From rrich at openjdk.org Wed Feb 19 17:01:59 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 19 Feb 2025 17:01:59 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: <6OgJdG6W07V8BYErKC8Nmje6UYkfiWCujW5v308zd9w=.491380bb-b7de-4549-bd66-911163176d1f@github.com> References: <6OgJdG6W07V8BYErKC8Nmje6UYkfiWCujW5v308zd9w=.491380bb-b7de-4549-bd66-911163176d1f@github.com> Message-ID: On Wed, 19 Feb 2025 16:24:57 GMT, Martin Doerr wrote: >> src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp line 2845: >> >>> 2843: if (dest == Runtime1::entry_for(C1StubId::register_finalizer_id) || >>> 2844: dest == Runtime1::entry_for(C1StubId::new_multi_array_id ) || >>> 2845: dest == Runtime1::entry_for(C1StubId::is_instance_of_id )) { >> >> Should there be an assertion that `dest` is in the CodeCache? >> Or even use that check as condition to emit the optimized call? > > I've added the assertion. I prefer having the special stubs listed explicitly. I hope that there will not be many more. Otherwise, we could still change it. Yes, maybe that's better. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23602#discussion_r1962053749 From mdoerr at openjdk.org Wed Feb 19 17:04:52 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 19 Feb 2025 17:04:52 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v4] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:24:44 GMT, Martin Doerr wrote: >> PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). >> The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. >> >> Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` >> >> Before this patch (C code) >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op >> ... >> SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op >> ... >> SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op >> SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op >> SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op >> SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op >> SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op >> SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op >> SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op >> SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op >> SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op >> SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op >> SecondarySupersLookup.tes... > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add assertion. Thanks for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23602#issuecomment-2669240261 From kvn at openjdk.org Wed Feb 19 19:07:53 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 19:07:53 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 12:47:03 GMT, Marc Chevalier wrote: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc What if you simply return `Top` node from `Ideal()` if `proj_out_or_null(TypeFunc::Parms) == nullptr;`? ------------- PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2627731450 From kvn at openjdk.org Wed Feb 19 19:11:52 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 19:11:52 GMT Subject: RFR: 8350344: Cross-build failure: _vptr conflicts with internal virtual virtual table field In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:13:16 GMT, Boris Ulasevich wrote: > With this change, I aim to address a build issue caused by the recent #23533 update, which introduced the CodeBlob _vptr field. This naming has led to build failures in HotSpot, reproduced on the Linaro GCC cross-toolchain for AArch64 and ARM32. > > src/hotspot/share/code/codeBlob.hpp:349:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:437:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:477:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:560:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; I am fine with renaming if it solves the issue. And it is trivial fix. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23703#pullrequestreview-2627737837 PR Comment: https://git.openjdk.org/jdk/pull/23703#issuecomment-2669530280 From mli at openjdk.org Wed Feb 19 20:09:34 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 19 Feb 2025 20:09:34 GMT Subject: RFR: 8350383: Test: add more test case for string compare (UL case) Message-ID: <8njXwM5PksWMNTNd2N_cDZ-kvTB6wAlzGRU3MxNIAqM=.adc0ee3d-9aa7-45cf-a017-5ddf70ad75bc@github.com> Hi, Can you help to review this simple test case improvement? Compared to LL/UU/LU string compare, UL case seems not enough to cover all the code path in intrinsics. This patch is to add these test case for UL string compare. Thanks! ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/23705/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23705&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350383 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23705.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23705/head:pull/23705 PR: https://git.openjdk.org/jdk/pull/23705 From bulasevich at openjdk.org Wed Feb 19 20:15:54 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 19 Feb 2025 20:15:54 GMT Subject: RFR: 8350344: Cross-build failure: _vptr name conflict In-Reply-To: References: Message-ID: <30BX3r3wEdrnVcxICpVQcBg_4SlzPxLPMvA4LpBC0Vc=.b4e831e0-5437-403b-91ff-0c270fc7f20c@github.com> On Wed, 19 Feb 2025 16:13:16 GMT, Boris Ulasevich wrote: > With this change, I aim to address a build issue caused by the recent #23533 update, which introduced the CodeBlob _vptr field. This naming has led to build failures in HotSpot, reproduced on the Linaro GCC cross-toolchain for AArch64 and ARM32. > > src/hotspot/share/code/codeBlob.hpp:349:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:437:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:477:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:560:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; Good. Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23703#issuecomment-2669661427 From bulasevich at openjdk.org Wed Feb 19 21:04:57 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 19 Feb 2025 21:04:57 GMT Subject: Integrated: 8350344: Cross-build failure: _vptr name conflict In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:13:16 GMT, Boris Ulasevich wrote: > With this change, I aim to address a build issue caused by the recent #23533 update, which introduced the CodeBlob _vptr field. This naming has led to build failures in HotSpot, reproduced on the Linaro GCC cross-toolchain for AArch64 and ARM32. > > src/hotspot/share/code/codeBlob.hpp:349:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:437:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:477:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; > ^~~~~ > src/hotspot/share/code/codeBlob.hpp:560:21: error: member '_vptr' conflicts with virtual function table field name > static const Vptr _vptr; This pull request has now been integrated. Changeset: 92efab90 Author: Boris Ulasevich URL: https://git.openjdk.org/jdk/commit/92efab90db24a76cc28fc1ae1db870a0dd670266 Stats: 23 lines in 3 files changed: 0 ins; 0 del; 23 mod 8350344: Cross-build failure: _vptr name conflict Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/23703 From dlong at openjdk.org Wed Feb 19 22:10:52 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 22:10:52 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 22:42:18 GMT, Dmitry Chuyko wrote: > The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. > > COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. LGTM ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23682#pullrequestreview-2628087307 From kvn at openjdk.org Wed Feb 19 22:44:51 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 22:44:51 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 22:42:18 GMT, Dmitry Chuyko wrote: > COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. Where you got this? Client VM will have only C1 and you guard will pass. ------------- Changes requested by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23682#pullrequestreview-2628137848 From kvn at openjdk.org Wed Feb 19 22:48:53 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 22:48:53 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 22:42:18 GMT, Dmitry Chuyko wrote: > The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. > > COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. Is it from here?: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/prims/jvm.cpp#L379 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23682#issuecomment-2669931783 From sviswanathan at openjdk.org Wed Feb 19 23:21:07 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 19 Feb 2025 23:21:07 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: <2OIYkOt8CJ-CqnQIK8sgMDtvLxJUyD5r_mKj5QT7_a8=.10b1d382-d9ae-40a1-b895-09086c80dee6@github.com> On Tue, 18 Feb 2025 02:36:13 GMT, Julian Waters wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Review comments resolutions > > Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux > > * For target hotspot_variant-server_libjvm_objs_mulnode.o: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous > 1944 | return TypeH::make(fma(f1, f2, f3)); > | ^ > In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)? > 544 | static const TypeH* make(float f); > | ^~~~ > /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)? > 545 | static const TypeH* make(short f); > | ^~~~ @TheShermanTanker I don't see any compile failures on Linux. Both the fastdebug and release build successfully. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2669979058 From dlong at openjdk.org Wed Feb 19 23:59:52 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 23:59:52 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 22:42:18 GMT, Dmitry Chuyko wrote: > The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. > > COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. I think @vnkozlov is right. I don't see where COMPILER1_OR_COMPILER2 is true for JVMCI. Should we use COMPILER1 || COMPILER2_OR_JVMCI, or remove the #if and instead guard with !PreserveFramePointer? ------------- Changes requested by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23682#pullrequestreview-2628245608 From fyang at openjdk.org Thu Feb 20 01:13:02 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 20 Feb 2025 01:13:02 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v3] In-Reply-To: <4fhBovVbAhjQhVhYDSf4XCr-JWoeQTcT6TVk-zM3gdY=.490462ed-5d03-4e17-8bee-e1354aeea250@github.com> References: <4fhBovVbAhjQhVhYDSf4XCr-JWoeQTcT6TVk-zM3gdY=.490462ed-5d03-4e17-8bee-e1354aeea250@github.com> Message-ID: On Tue, 18 Feb 2025 10:18:26 GMT, Hamlin Li wrote: >> Thanks for the update. Several more comments after another look. > >> Thanks for the update. Several more comments after another look. > > Thanks, all fixed. @Hamlin-Li : Hi, I think it's OK to remove the FP part of the change for now. We can do more evaluation for it then. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23580#issuecomment-2670137369 From duke at openjdk.org Thu Feb 20 01:17:54 2025 From: duke at openjdk.org (Nicole Xu) Date: Thu, 20 Feb 2025 01:17:54 GMT Subject: RFR: 8349943: [JMH] Use jvmArgs consistently In-Reply-To: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> References: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> Message-ID: On Thu, 13 Feb 2025 08:35:47 GMT, Nicole Xu wrote: > As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. > > All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. Hi @cl4es, here is a fellow-up fix of `jvmArgs` flags. Could you please review the changes? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23609#issuecomment-2670146567 From redestad at openjdk.org Thu Feb 20 01:25:59 2025 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 20 Feb 2025 01:25:59 GMT Subject: RFR: 8349943: [JMH] Use jvmArgs consistently In-Reply-To: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> References: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> Message-ID: On Thu, 13 Feb 2025 08:35:47 GMT, Nicole Xu wrote: > As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. > > All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. Marked as reviewed by redestad (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23609#pullrequestreview-2628391999 From duke at openjdk.org Thu Feb 20 01:30:58 2025 From: duke at openjdk.org (Nicole Xu) Date: Thu, 20 Feb 2025 01:30:58 GMT Subject: RFR: 8349943: [JMH] Use jvmArgs consistently In-Reply-To: References: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> Message-ID: On Thu, 13 Feb 2025 20:04:05 GMT, Chen Liang wrote: >> As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. >> >> All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. > > The java.lang.foreign arg changes look fine. @liach @sendaoYan @cl4es Thanks for your review. I'm going to integrate the patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23609#issuecomment-2670174332 From duke at openjdk.org Thu Feb 20 01:30:58 2025 From: duke at openjdk.org (duke) Date: Thu, 20 Feb 2025 01:30:58 GMT Subject: RFR: 8349943: [JMH] Use jvmArgs consistently In-Reply-To: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> References: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> Message-ID: On Thu, 13 Feb 2025 08:35:47 GMT, Nicole Xu wrote: > As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. > > All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. @xyyNicole Your change (at version 06f874154e22886d7f1522a70dbdccd87fb4d004) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23609#issuecomment-2670175001 From haosun at openjdk.org Thu Feb 20 01:36:56 2025 From: haosun at openjdk.org (Hao Sun) Date: Thu, 20 Feb 2025 01:36:56 GMT Subject: RFR: 8349943: [JMH] Use jvmArgs consistently In-Reply-To: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> References: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> Message-ID: On Thu, 13 Feb 2025 08:35:47 GMT, Nicole Xu wrote: > As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. > > All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. This patch has been reviewed and tested internally. GHA tests are all green. Let me sponsor it. ------------- Marked as reviewed by haosun (Committer). PR Review: https://git.openjdk.org/jdk/pull/23609#pullrequestreview-2628468812 From duke at openjdk.org Thu Feb 20 01:36:56 2025 From: duke at openjdk.org (Nicole Xu) Date: Thu, 20 Feb 2025 01:36:56 GMT Subject: Integrated: 8349943: [JMH] Use jvmArgs consistently In-Reply-To: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> References: <08dxLOf4JEBqkJxDjlJTib4_zmUraTx6-mO9FIblDx0=.61b739e3-4ea4-4be0-a3ec-459376863c5a@github.com> Message-ID: On Thu, 13 Feb 2025 08:35:47 GMT, Nicole Xu wrote: > As is suggested in [JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958), `jvmArgs` should be used consistently in microbenchmarks to 'align with the intuition that when you use jvmArgsAppend/-Prepend intent is to add to a set of existing flags, while if you supply jvmArgs intent is "run with these and nothing else"'. > > All the previous flags were aligned in https://github.com/openjdk/jdk/pull/21683, while some recent tests use inconsistent `jvmArgs` again. We update them to keep the consistency. This pull request has now been integrated. Changeset: 3ebed783 Author: Nicole Xu Committer: Hao Sun URL: https://git.openjdk.org/jdk/commit/3ebed78328bd64d2e18369d63d6ea323b87a7b24 Stats: 20 lines in 9 files changed: 2 ins; 0 del; 18 mod 8349943: [JMH] Use jvmArgs consistently Reviewed-by: syan, redestad, haosun ------------- PR: https://git.openjdk.org/jdk/pull/23609 From xgong at openjdk.org Thu Feb 20 05:47:53 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 20 Feb 2025 05:47:53 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 08:14:40 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2735: >> >>> 2733: ShouldNotReachHere(); >>> 2734: } >>> 2735: mulv(dst, size2, index, tmp1); >> >> Can we use vector `lsl` instead of `mul` here, so that we can also support D types for NEON/SVE1 ? > > @XiaohongGong , thanks I'll give it a try and get back. @Bhavana-Kilambi , left shift can not get right indexes here as values `0x2, 0x4` is landed in each B lane. Maybe we can just try with `bsl` for D size types, as it has only two lanes for long/double types with 128-bit vector length. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1962894625 From dchuyko at openjdk.org Thu Feb 20 05:55:52 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 20 Feb 2025 05:55:52 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 22:46:02 GMT, Vladimir Kozlov wrote: > Is it from here?: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/prims/jvm.cpp#L379 > > Yes, I mean this check. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23682#issuecomment-2670531640 From fyang at openjdk.org Thu Feb 20 06:15:53 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 20 Feb 2025 06:15:53 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v6] In-Reply-To: References: Message-ID: <0um0rwt97V6Q8Qen419I2t1lFga92zsYm2fOcLKWSTA=.8568e7bb-0c9c-4375-a274-b40ca3328280@github.com> On Tue, 18 Feb 2025 14:33:52 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? >> This optimization is mainly for the vector API. >> On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). >> >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% >> ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% >> DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% >> DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% >> FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% >> FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% >> IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% >> IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% >> LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% >> LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% >> ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% >> ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% >> >> > > Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: > > - fix unordered > - fix src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 244: > 242: void reduce_mul_integral_v(Register dst, Register src1, VectorRegister src2, > 243: VectorRegister vtmp1, VectorRegister vtmp2, BasicType bt, > 244: uint len, VectorMask vm = Assembler::unmasked); Do you mind renaming this param `len` to `vector_length`? Then it will be consistent with friends `compare_integral_v`, `compare_fp_v`, etc in param naming. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1962915698 From epeter at openjdk.org Thu Feb 20 06:48:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 20 Feb 2025 06:48:59 GMT Subject: RFR: 8348572: C2 compilation asserts due to unexpected irreducible loop [v3] In-Reply-To: References: <27h3jiEmNB_5e2BIb1AGh-Rq_A2K3qVrj_RGo9RBOC8=.c90b34a0-8a7f-46c6-bee7-037b25cfec85@github.com> Message-ID: <7btoW9Bz7csZ7LneMc7wPBkgFEkSysbB13oyHg02veM=.08ad73b6-6ee8-4c05-8f48-0f3cde5aa2b3@github.com> On Wed, 5 Feb 2025 12:18:07 GMT, Emanuel Peter wrote: >> A quick summary: >> - In [JDK-8280126](https://bugs.openjdk.org/browse/JDK-8280126), we decided that we are only going to allow irreducible loops that were detected at parsing, and we can thus restrict optimizations to reducible loops which would be difficult to do correct with irreducible loops. That's why we added that assert that checks that no new irreducible loop shows up during compilation. >> - Problem: we use `split_if` for `IfNode::Ideal_common` to split through a Region that is loop-head, and the splitting of the Region introduces a second loop entry -> irreducible loop. >> >> Before `split_if`: >> ![image](https://github.com/user-attachments/assets/01bc78fa-7fed-4a8f-b6f4-078dac9b5dc4) >> >> After `split_if`: >> ![image](https://github.com/user-attachments/assets/1e3bd08e-b76d-4e7f-813e-27a5a22cb2bd) >> >> >> - We have the `split_if` for `IfNode::Ideal_common` to do split-if on straight-line code. But we currently execute this before loop-opts, and so we don't know if the region we split through is actually a loop head. We guard against LoopNode, but a Region only becomes a LoopNode in loop-opts. >> - We also have split-if in loop-opts, which is more careful about splitting through loop-heads. >> - Just removing the straight-line split-if probably leads to a regression, as the loop-opts version only executes if there are loops for example. >> - We could consider delaying the straight-line split-if until after loop-opts. But I don't know if that could lead to regressions in any way. >> >> I discussed this temporary solution with @TobiHartmann : >> - We would like [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) to be unblocked for @shipilev . >> - Convert the assert into a bailout-check, so we are sure we behave correctly in product. Compiling with irreducible loops behaves correctly in almost all cases, but there could be exceptions. >> - For now, have the assert behind a Verify flag, so that [JDK-8348570](https://bugs.openjdk.org/browse/JDK-8348570) is unblocked. Later, we can remove the Verify flag and alway enable the assert again. >> - This fix also looks easier to backport. >> >> ----------------------- >> >> The attached regression test now does **NOT** fail by default, but rather silently bails out of compilation. >> >> With the new debug flag `-XX:+VerifyNoNewIrreducibleLoops`, we still hit the assert, as expected: >> >> # Internal Error (/oracle-work/jdk-fork0/open/src/hotspot/share/opto/loopnode.cpp:5636), pid=3698055, tid=3698072 >> # asser... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/loopopts/TestSplitIfNewIrreducibleLoop.java > > Co-authored-by: Tobias Hartmann FYI, I filed: [JDK-8350400](https://bugs.openjdk.org/browse/JDK-8350400) C2: split_if should not create irreducible loops ------------- PR Comment: https://git.openjdk.org/jdk/pull/23363#issuecomment-2670604348 From epeter at openjdk.org Thu Feb 20 07:21:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 20 Feb 2025 07:21:45 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: Message-ID: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: adjust selector if probability ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22016/files - new: https://git.openjdk.org/jdk/pull/22016/files/a98ffabf..b3044bc5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/22016.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016 PR: https://git.openjdk.org/jdk/pull/22016 From bkilambi at openjdk.org Thu Feb 20 08:51:51 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 20 Feb 2025 08:51:51 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 05:45:34 GMT, Xiaohong Gong wrote: >> @XiaohongGong , thanks I'll give it a try and get back. > > @Bhavana-Kilambi , left shift can not get right indexes here as values `0x2, 0x4` is landed in each B lane. Maybe we can just try with `bsl` for D size types, as it has only two lanes for long/double types with 128-bit vector length. Hi @XiaohongGong , thanks but bsl instruction only has 8B/16B types. not D type. I'll see how I can do this with bsl. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1963101541 From dchuyko at openjdk.org Thu Feb 20 08:54:59 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Thu, 20 Feb 2025 08:54:59 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 23:57:34 GMT, Dean Long wrote: > remove the #if and instead guard with !PreserveFramePointer? It doesn't seem necessary to change the current behavior for Int->C2, especially only for a single platform. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23682#issuecomment-2670837344 From aph at openjdk.org Thu Feb 20 08:56:53 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 20 Feb 2025 08:56:53 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: <8b5akUdVBzLIzK3gYiqOGmmM1eJ2HK_KDWRs8JZ1ZFo=.585d2e3e-b8cc-484e-996a-b4501f33338b@github.com> On Wed, 19 Feb 2025 09:53:40 GMT, Bhavana Kilambi wrote: >> I'm still curious. > > Hi @theRealAph , apologies for the late response. The tbl instruction needs both the source registers to be consecutive and I could not find a way to make the register allocator choose two consecutive registers for this operation and decided to hard code them. As v0-v7 are used for function arguments, v8-v15 are non-volatile which are not needed for this purpose (as we dont want to be preserving these values across function calls), I chose two of the volatile registers from v16-v31 for the source registers. Please let me know if this is the right way to approach. I suppose it is, yes. Thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1963108465 From duke at openjdk.org Thu Feb 20 09:25:52 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 20 Feb 2025 09:25:52 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 12:47:03 GMT, Marc Chevalier wrote: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Thanks for taking a look. I'm not sure what you mean, so I'll try to cover a couple of interpretations I could have. 1. Still do the rewiring, but return `Top` instead of `new ConINode(TypeInt::ZERO)` => I guess that doesn't change much. In any way, I need to return a fresh node, or `this` from ideal. So I can't return the `phase->C->top()` top node, or something like that, I'd have to create a new one, unlinked. I guess that would work, but it makes very little difference with my version. 2. Not doing the rewiring, at all, just return a `Top` node from `Ideal` => that would mean to replace the current frem/drem by a `Top`, which would propagate through the control output (at least, that's my current understanding), and just kill the flow (make top/unreachable) for everything under. That would terminate the execution there rather than just skipping computation. 3. If we are speaking about types, we could also return top from `Value()` => I think the same as above would happen. But then again, I'm just getting used to this codebase, so I might be wrong. Or maybe I misunderstood and you meant something else! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2670911675 From jbhateja at openjdk.org Thu Feb 20 09:33:59 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 20 Feb 2025 09:33:59 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 03:50:19 GMT, Nicole Xu wrote: > Sure. Since I am very new to openJDK, I asked my teammate for help to file the follow-up RFE. > > Here is the https://bugs.openjdk.org/browse/JDK-8350215 with description of the discussed issues. Hi @xyyNicole , I have modified the benchmark keeping its essence intact, i.e. to use sufficient number of predicated operations within the vector loop while minimizing the noise due to memory operations. Modified the index computation logic which can now withstand any ARRAYLEN without resulting in an IOOBE. Removed redundant vector read/writes to instance fields, thus eliminating significant boxing penalty which translates into throughput gains. Please feel free to include it along with this patch. [MaskedLogicOpts.txt](https://github.com/user-attachments/files/18884093/MaskedLogicOpts.txt) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2670932944 From roland at openjdk.org Thu Feb 20 09:46:58 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 20 Feb 2025 09:46:58 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 15:23:13 GMT, Emanuel Peter wrote: > Do you see any better way than having the 2x code size if we need both a slow and fast loop? No but I was confused by your comment about 3x and 4x which is why I asked for clarification. Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2670957288 From roland at openjdk.org Thu Feb 20 09:46:58 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 20 Feb 2025 09:46:58 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Thu, 20 Feb 2025 09:39:59 GMT, Roland Westrelin wrote: >>> So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x. >> >> Yes, if you get to the point where you add a multi-version-if condition, i.e. where SuperWord has decided it needs a speculative assumption (here for alignment, later for aliasing), then we get the whole loop 2x. I suppose we could try to make the pre-main-post loop more complicated and just multi-version the main-loop, but that sounds much more complicated. >> >> Do you see any better way than having the 2x code size if we need both a slow and fast loop? > >> Do you see any better way than having the 2x code size if we need both a slow and fast loop? > > No but I was confused by your comment about 3x and 4x which is why I asked for clarification. > Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said. > @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. > > In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. Do you understand when that happens? It doesn't feel right that the pre loop can be lost. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2670971210 From roland at openjdk.org Thu Feb 20 09:47:01 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 20 Feb 2025 09:47:01 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com> References: <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com> Message-ID: On Tue, 18 Feb 2025 09:42:17 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopUnswitch.cpp line 513: >> >>> 511: >>> 512: // Create new Region. >>> 513: RegionNode* region = new RegionNode(1); >> >> So we create a new `Region` every time a new condition is added? > > Yes. Are you ok with that? Or would you prefer if we extended an existing region (is that possible?) and then we'd have 2 cases, one where there is none yet, and one where we'd extend. I think adding one each time is easier, and it would get commoned anyway, right? That sounds ok to me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1963217281 From roland at openjdk.org Thu Feb 20 09:47:03 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 20 Feb 2025 09:47:03 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> Message-ID: On Tue, 18 Feb 2025 10:26:37 GMT, Roland Westrelin wrote: >> @rwestrel do you consider that a blocking issue for this PR here? > > No I filed: https://bugs.openjdk.org/browse/JDK-8350330 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1963215126 From jbhateja at openjdk.org Thu Feb 20 09:47:50 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 20 Feb 2025 09:47:50 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v26] In-Reply-To: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: > Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. > Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. > > Following are the performance stats for JMH micro included with the patch. > > > Granite Rapids (P-core Xeon Server) > Baseline : > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms > > Sierra Forest (E-core Xeon Server) > Baseline: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 3352.839 ... Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 - Safety assertion added - Review resolutions - Lowering feature check to IR annotation level - Adding missed feature check - Review comments resolutions. - Modifed scheme not based over fragile node level flags base solution. - Updating comments for clarity - Adding a missed check to skip over commoning of predicated vector operations - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 - ... and 10 more: https://git.openjdk.org/jdk/compare/1e87ff01...acb613da ------------- Changes: https://git.openjdk.org/jdk/pull/22863/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22863&range=25 Stats: 789 lines in 4 files changed: 788 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/22863.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22863/head:pull/22863 PR: https://git.openjdk.org/jdk/pull/22863 From chagedorn at openjdk.org Thu Feb 20 09:58:52 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 20 Feb 2025 09:58:52 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 12:47:03 GMT, Marc Chevalier wrote: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc What I did in my full fix for Assertion Predicates to remove a no-longer-needed CFG node (i.e. treat it as a `nop`) is to just return the input in `Identity()`: Node* TemplateAssertionPredicateNode::Identity(PhaseGVN* phase) { if (phase->C->post_loop_opts_phase() || _useless) { return in(0); } ... } But since we have a `CallNode` with projections, I'm not sure if we could do the same but might be worth a try? We then probably need to add the check in `ProjNode::Identity()` and return `in(0)->in(0)` to skip over the `drem/frem` node - similar to what we do in `IfProjNode::Identity()` to skip over the `If` node that has projections: https://github.com/openjdk/jdk/blob/1e87ff01994df16df7de331040fc5d7a4a85f630/src/hotspot/share/opto/ifnode.cpp#L1819 Not sure if we need to do it only for the control projection or all the outgoing projections of the dead `drem/frem`. If both worked, we can still decide if we want that additional `ProjNode::Identity()` method instead of handling it the way you propose it in this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2671002605 From chagedorn at openjdk.org Thu Feb 20 10:34:51 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 20 Feb 2025 10:34:51 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 12:47:03 GMT, Marc Chevalier wrote: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Scratch that - I think it's not well applicable since `ProjNode` is too general. Would have probably worked if we had a special projection node just for `drem/frem` but that would not be worth the trouble. I think it's better the way you have it inside `ModF/ModD`. This also saves a lot of checks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2671092747 From epeter at openjdk.org Thu Feb 20 10:35:06 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 20 Feb 2025 10:35:06 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Thu, 20 Feb 2025 09:44:16 GMT, Roland Westrelin wrote: > > @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. > > In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. > > Do you understand when that happens? It doesn't feel right that the pre loop can be lost. `VLoop::check_preconditions_helper` has a check like this: // To align vector memory accesses in the main-loop, we will have to adjust // the pre-loop limit. if (_cl->is_main_loop()) { CountedLoopEndNode* pre_end = _cl->find_pre_loop_end(); if (pre_end == nullptr) { return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT); } Node* pre_opaq1 = pre_end->limit(); if (pre_opaq1->Opcode() != Op_Opaque1) { return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT); } _pre_loop_end = pre_end; } I don't remember exactly why the pre-loop disappears. They are rare cases. The pre-loop somehow folds away, maybe because it only has a single iteration, or just so few that it would never take the backedge. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2671093141 From qamai at openjdk.org Thu Feb 20 11:00:03 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 20 Feb 2025 11:00:03 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 12:47:03 GMT, Marc Chevalier wrote: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc I think this is getting increasingly ad-hoc for a pretty niche use-case. Can we have a general solution that works for other pure calls (e.g trigonometric functions), too? Related: [JDK-8347901](https://bugs.openjdk.org/browse/JDK-8347901) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2671153511 From varadam at openjdk.org Thu Feb 20 11:26:56 2025 From: varadam at openjdk.org (Varada M) Date: Thu, 20 Feb 2025 11:26:56 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v4] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:24:44 GMT, Martin Doerr wrote: >> PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). >> The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. >> >> Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` >> >> Before this patch (C code) >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op >> ... >> SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op >> ... >> SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op >> SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op >> SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op >> SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op >> SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op >> SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op >> SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op >> SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op >> SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op >> SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op >> SecondarySupersLookup.tes... > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add assertion. Looks good to me! ------------- Marked as reviewed by varadam (Committer). PR Review: https://git.openjdk.org/jdk/pull/23602#pullrequestreview-2629559275 From jbhateja at openjdk.org Thu Feb 20 11:37:08 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 20 Feb 2025 11:37:08 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 02:36:13 GMT, Julian Waters wrote: > Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux > > ``` > * For target hotspot_variant-server_libjvm_objs_mulnode.o: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous > 1944 | return TypeH::make(fma(f1, f2, f3)); > | ^ > In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)? > 544 | static const TypeH* make(float f); > | ^~~~ > /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)? > 545 | static const TypeH* make(short f); > | ^~~~ > ``` Hi @TheShermanTanker , Please file a separate JBS issue for the errors you are observing with non-standard build options. I am also seeing some other build issues with the following configuration --with-extra-cxxflags=-D__CORRECT_ISO_CPP11_MATH_H_PROTO_FP Best Regards, Jatin ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2671231948 From aph at openjdk.org Thu Feb 20 11:37:52 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 20 Feb 2025 11:37:52 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v3] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:06:29 GMT, Martin Doerr wrote: > > Is this change actually worthwhile on PPC? > > It's not a big gain. I've rather implemented it for parity with the other platforms. Maybe we want to remove the C++ version at some point of time? Maybe, but I see no compelling reason to do so. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23602#issuecomment-2671238080 From aph-open at littlepinkcloud.com Thu Feb 20 11:38:00 2025 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Thu, 20 Feb 2025 11:38:00 +0000 Subject: When does C1 use the same register for inputs and temps? Message-ID: <09ea4591-c9c4-422e-9e95-da95ba9f6a6a@littlepinkcloud.com> Sometimes C1 allocates the same register for inputs and temps. There are several workarounds in the back ends for this. For example, here in x86 LIR_Assembler::emit_typecheck_helper: Register obj = op->object()->as_register(); Register k_RInfo = op->tmp1()->as_register(); ... if (obj == k_RInfo) { k_RInfo = dst; } else if (obj == klass_RInfo) { klass_RInfo = dst; } Is it simply that if you want an input reg not to be reused for a temp, then you should call do_temp() as well as do_input() on the input arg, as is done in LIR_Op2? But if that is the case, would it not have been better always to call do_temp() as well as do_input() on opTypeCheck->_object, thus avoiding the code I quoted above? Was it simply that way back when, people were trying to use the absolute minimum of registers? Thanks, -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From mdoerr at openjdk.org Thu Feb 20 12:06:01 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 20 Feb 2025 12:06:01 GMT Subject: RFR: 8349727: [PPC] C1: Improve Class.isInstance intrinsic [v4] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:24:44 GMT, Martin Doerr wrote: >> PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). >> The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. >> >> Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` >> >> Before this patch (C code) >> >> Benchmark Mode Cnt Score Error Units >> SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op >> ... >> SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op >> SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op >> SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op >> SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op >> SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op >> SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op >> SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op >> SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op >> SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op >> SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op >> SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op >> SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op >> SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op >> SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op >> SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op >> ... >> SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op >> SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op >> SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op >> SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op >> SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op >> SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op >> SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op >> SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op >> SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op >> SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op >> SecondarySupersLookup.tes... > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add assertion. Thanks for the reviews and comments! It's a bit faster and does no harm. At least PPC64 is no longer in the way if we want to remove the C++ implementation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23602#issuecomment-2671298663 From mdoerr at openjdk.org Thu Feb 20 12:06:02 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 20 Feb 2025 12:06:02 GMT Subject: Integrated: 8349727: [PPC] C1: Improve Class.isInstance intrinsic In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 21:01:59 GMT, Martin Doerr wrote: > PPC64 implementation of [JDK-8337251](https://bugs.openjdk.org/browse/JDK-8337251). > The new runtime stub is called like a C function. The initial version therefore used a `FunctionDescriptor` with relocation on PPC64 with ABIv1. I've changed that with the 3rd Commit. `rt_call` jumps directly to the entry point, now. > > Performance measured on Power10: `make run-test TEST="micro:SecondarySupersLookup" MICRO="VM_OPTIONS=-XX:TieredStopAtLevel=1"` > > Before this patch (C code) > > Benchmark Mode Cnt Score Error Units > SecondarySupersLookup.testNegative00 avgt 15 18.570 ? 0.009 ns/op > ... > SecondarySupersLookup.testNegative30 avgt 15 18.566 ? 0.002 ns/op > SecondarySupersLookup.testNegative32 avgt 15 19.177 ? 1.347 ns/op > SecondarySupersLookup.testNegative40 avgt 15 18.569 ? 0.006 ns/op > SecondarySupersLookup.testNegative50 avgt 15 19.207 ? 1.334 ns/op > SecondarySupersLookup.testNegative55 avgt 15 19.708 ? 1.338 ns/op > SecondarySupersLookup.testNegative56 avgt 15 19.132 ? 0.137 ns/op > SecondarySupersLookup.testNegative57 avgt 15 19.133 ? 0.134 ns/op > SecondarySupersLookup.testNegative58 avgt 15 19.772 ? 1.316 ns/op > SecondarySupersLookup.testNegative59 avgt 15 19.109 ? 0.014 ns/op > SecondarySupersLookup.testNegative60 avgt 15 22.381 ? 0.016 ns/op > SecondarySupersLookup.testNegative61 avgt 15 22.331 ? 0.011 ns/op > SecondarySupersLookup.testNegative62 avgt 15 22.352 ? 0.029 ns/op > SecondarySupersLookup.testNegative63 avgt 15 30.371 ? 0.031 ns/op > SecondarySupersLookup.testNegative64 avgt 15 29.927 ? 0.221 ns/op > SecondarySupersLookup.testPositive01 avgt 15 18.571 ? 0.006 ns/op > ... > SecondarySupersLookup.testPositive09 avgt 15 18.599 ? 0.140 ns/op > SecondarySupersLookup.testPositive10 avgt 15 19.210 ? 1.332 ns/op > SecondarySupersLookup.testPositive16 avgt 15 18.603 ? 0.142 ns/op > SecondarySupersLookup.testPositive20 avgt 15 19.210 ? 1.333 ns/op > SecondarySupersLookup.testPositive30 avgt 15 18.600 ? 0.140 ns/op > SecondarySupersLookup.testPositive32 avgt 15 18.637 ? 0.189 ns/op > SecondarySupersLookup.testPositive40 avgt 15 19.137 ? 0.190 ns/op > SecondarySupersLookup.testPositive50 avgt 15 18.567 ? 0.002 ns/op > SecondarySupersLookup.testPositive60 avgt 15 19.069 ? 0.004 ns/op > SecondarySupersLookup.testPositive63 avgt 15 26.024 ? 0.017 ns/op > SecondarySupersLookup.testPositive64 avgt 15 29.932 ? 1.002 ns/op > > > After this patch (assemble... This pull request has now been integrated. Changeset: 735805d9 Author: Martin Doerr URL: https://git.openjdk.org/jdk/commit/735805d9259037ae594eb4f75e96860d43feea5d Stats: 105 lines in 4 files changed: 82 ins; 10 del; 13 mod 8349727: [PPC] C1: Improve Class.isInstance intrinsic Reviewed-by: rrich, varadam ------------- PR: https://git.openjdk.org/jdk/pull/23602 From chagedorn at openjdk.org Thu Feb 20 12:23:37 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 20 Feb 2025 12:23:37 GMT Subject: RFR: 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 Message-ID: In the refactoring for [JDK-8344035](https://bugs.openjdk.org/browse/JDK-8344035), the value passed for the `rewire_uncommon_proj_phi_inputs` parameter in `PhaseIdealLoop::create_new_if_for_predicate()` during Loop Unswitching was accidentally flipped. It should only be set to `true` when calling it for a false-path loop, which is the cloned loop. This is currently not the case and leads to a bad graph due to folding nodes wrongly: https://github.com/openjdk/jdk/blob/735805d9259037ae594eb4f75e96860d43feea5d/src/hotspot/share/opto/predicates.cpp#L84-L88 I fixed this by just flipping the parameter from `is_true_path_loop` to `is_false_path_loop` to avoid a negation. I added an additional comment to `PhaseIdealLoop::create_new_if_for_predicate()` about `rewire_uncommon_proj_phi_inputs`. More background about why we need `rewire_uncommon_proj_phi_inputs` in the first place can be found in the corresponding fix for [JDK-8290850](https://bugs.openjdk.org/browse/JDK-8290850): https://github.com/openjdk/jdk/pull/11452, and additionally in https://github.com/openjdk/jdk/pull/5185 Thanks, Christian ------------- Commit messages: - 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 Changes: https://git.openjdk.org/jdk/pull/23712/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23712&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349032 Stats: 88 lines in 5 files changed: 70 ins; 0 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/23712.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23712/head:pull/23712 PR: https://git.openjdk.org/jdk/pull/23712 From thartmann at openjdk.org Thu Feb 20 12:37:57 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 20 Feb 2025 12:37:57 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v5] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 05:13:07 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into fix-8349637 > - Comments from review, add exhaustive test > - Improve explanation of logic > - Comments from code review > - Fix CountLeadingZerosV miscompile on AVX2 Fix and tests look good to me. Thanks! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23579#pullrequestreview-2629717631 From thartmann at openjdk.org Thu Feb 20 12:51:57 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 20 Feb 2025 12:51:57 GMT Subject: RFR: 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 12:18:57 GMT, Christian Hagedorn wrote: > In the refactoring for [JDK-8344035](https://bugs.openjdk.org/browse/JDK-8344035), the value passed for the `rewire_uncommon_proj_phi_inputs` parameter in `PhaseIdealLoop::create_new_if_for_predicate()` during Loop Unswitching was accidentally flipped. It should only be set to `true` when calling it for a false-path loop, which is the cloned loop. This is currently not the case and leads to a bad graph due to folding nodes wrongly: > https://github.com/openjdk/jdk/blob/735805d9259037ae594eb4f75e96860d43feea5d/src/hotspot/share/opto/predicates.cpp#L84-L88 > > I fixed this by just flipping the parameter from `is_true_path_loop` to `is_false_path_loop` to avoid a negation. I added an additional comment to `PhaseIdealLoop::create_new_if_for_predicate()` about `rewire_uncommon_proj_phi_inputs`. > > More background about why we need `rewire_uncommon_proj_phi_inputs` in the first place can be found in the corresponding fix for [JDK-8290850](https://bugs.openjdk.org/browse/JDK-8290850): https://github.com/openjdk/jdk/pull/11452, and additionally in https://github.com/openjdk/jdk/pull/5185 > > Thanks, > Christian Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23712#pullrequestreview-2629752387 From rcastanedalo at openjdk.org Thu Feb 20 13:22:17 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 20 Feb 2025 13:22:17 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v4] In-Reply-To: References: Message-ID: > This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. > > Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: > > > java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 > > > produces the following visualization for the `Initial spilling` phase: > > ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) > > Live ranges are first-class IGV entities, meaning that the user can: > > - search, select, and extract them; > > ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) > > - examine their properties in the `Properties` window or via tooltips; > > ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) > > - navigate to related IGV entities via a pop-up menu; and > > ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) > > - program filters that act om them according to their properties. > > ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) > > Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: > > ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c) > > The changeset extends the IGV graph printing logic in HotSpot t... Roberto Casta?eda Lozano has updated the pull request incrementally with three additional commits since the last revision: - Handle single-block CFGs - Open and close live ranges joined by phis in their respective blocks - Export liveness information when saving a graph from IGV ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23558/files - new: https://git.openjdk.org/jdk/pull/23558/files/08ee449e..51718b90 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23558&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23558&range=02-03 Stats: 108 lines in 5 files changed: 82 ins; 11 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/23558.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23558/head:pull/23558 PR: https://git.openjdk.org/jdk/pull/23558 From rcastanedalo at openjdk.org Thu Feb 20 13:22:17 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 20 Feb 2025 13:22:17 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v3] In-Reply-To: <2qylkM_X00_j1T2n41H0Cpuu9G05mCS3CnJAbygO-x8=.19332e68-be26-4dbd-9dcc-5a69d4054284@github.com> References: <2qylkM_X00_j1T2n41H0Cpuu9G05mCS3CnJAbygO-x8=.19332e68-be26-4dbd-9dcc-5a69d4054284@github.com> Message-ID: On Wed, 19 Feb 2025 15:15:34 GMT, Roberto Casta?eda Lozano wrote: > > I noticed that the live ranges are not saved when saving the graph into an xml file (`LIVE_RANGES_ELEMENT` and related tags don't seem to be exported in `Printer.java`). Is this perhaps something you did intentionally (maybe to be added in the future)? > > Good catch, thanks! No, I just overlooked this use case. Will fix. Done (commit 87b31e9e). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2671471968 From rcastanedalo at openjdk.org Thu Feb 20 13:44:53 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 20 Feb 2025 13:44:53 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: References: <8ViA6x7l9mMjBEEfKR3LSICyAh4AANl_mq6wP5TEt9Y=.33b9b461-814f-48df-97da-214d0d44e4c3@github.com> Message-ID: On Wed, 19 Feb 2025 15:25:41 GMT, Roberto Casta?eda Lozano wrote: > > I thought that variables that are joined by the Phi node are still live at the Phi node. Is this not the case? > > No, the usual "multiplex-like" liveness semantics for Phi instructions is to consider the joined variables live-out of their corresponding predecessor blocks and the resulting variable live-in in its block (and defined in parallel with other Phi definitions in the block), see e.g. Definition 4 in Ch. 21.2 in [the SSA book draft](https://pfalcon.github.io/ssabook/latest/book-full.pdf). This is also in line with [C2's handling of Phi nodes in liveness analysis](https://github.com/openjdk/jdk/blob/efbad00c4d7931177ccc5e9bce3b30dfbac94010/src/hotspot/share/opto/live.cpp#L128-L147). > > > Irrespective of that, would it be feasible to add a "termination dash" at the bottom of the line (e.g. at the bottom of `L8`)? > > Yes, that is a good idea, will do, thanks! Done (commit 31e4510e). This turned out to be a bit more involved than I thought, please check that the changes meet your expectations. Here is an example of how the initial live ranges related to a phi instruction (`106 Phi`) are now visualized: ![initial-liveness](https://github.com/user-attachments/assets/0f32fe6b-a72d-4cb5-b2ca-7fb9a1c6e178) And here is how the live range `L7` resulting from coalescing `L18`, `L34`, and `L36` is visualized after aggressive coalescing (where SSA is deconstructed): ![aggressive-coalescing](https://github.com/user-attachments/assets/357c084a-6282-4a72-88d0-c7fea211eacb) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2671533097 From rcastanedalo at openjdk.org Thu Feb 20 13:44:54 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 20 Feb 2025 13:44:54 GMT Subject: RFR: 8348645: IGV: visualize live ranges In-Reply-To: References: <8ViA6x7l9mMjBEEfKR3LSICyAh4AANl_mq6wP5TEt9Y=.33b9b461-814f-48df-97da-214d0d44e4c3@github.com> Message-ID: On Wed, 19 Feb 2025 12:03:49 GMT, Damon Fenacci wrote: >>> Thanks for the report Damon, will investigate! >> >> Commit 00169223 should fix the issue, thanks again. > > @robcasloz, I was a bit puzzled by live ranges with Phi nodes but then I noticed that in the description you mention that they are treated somewhat in a special way: >> To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. > > I thought that variables that are joined by the Phi node are still live at the Phi node. Is this not the case? Or possibly you meant that it is better not to consider them live there (e.g. to reduce the number of live ranges in the block with the Phi node)? > > Irrespective of that, would it be feasible to add a "termination dash" at the bottom of the line (e.g. at the bottom of `L8`)? > image While studying the issues brought up by @dafedafe, I also realized that live ranges of single-block CFGs were not displayed. This is now addressed by commit 51718b90. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2671537556 From duke at openjdk.org Thu Feb 20 14:16:57 2025 From: duke at openjdk.org (simon) Date: Thu, 20 Feb 2025 14:16:57 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: <5FQAlGTqRPunlRzvtkuMiqKSLg9Mrj7eAEzmYsk36Is=.ec2ab377-4bf4-4029-afbe-c7b286b4b9a7@github.com> On Tue, 18 Feb 2025 15:19:26 GMT, Christian Hagedorn wrote: >>> Hello @marc-chevalier! I have already open a PR for this matter. PR is #23480. >> >> Hi @gustavosimon, the JBS issue was already assigned to @marc-chevalier. If you intend to work on an issue, please check the following: >> - The issue is already assigned in JBS? >> - Reach out to the assignee and ask if the person is currently working on the issue or has intentions to do so. If not, they can reassign it to you or someone else on your behalf (if you don't have a JBS account). >> - The issue is unassigned in JBS? >> - Assign the issue to yourself. >> - If you don't have a JBS account: Reach out to someone who can assign it to him/herself on your behalf. >> >> >> This avoids "stealing" work that was in progress or planned to do later or even worse doing completely duplicated work which is unfortunate. > >> @chhagedorn Got it. Actually, when I started working on this, the issue was unassigned. > > Oh, I see - looks like an unfortunate timing! > >> I will ask to @RealCLanger to assign it to me next times. > > Sounds good :-) > >> Can you review my OCA verification? > > We pinged @robilad to review it. Hello @chhagedorn! Any luck about this matter? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2671627092 From epeter at openjdk.org Thu Feb 20 14:20:08 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 20 Feb 2025 14:20:08 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v26] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: <_eQUEjpyCStHV8b-Y78EwjjmX0osl1vtD5iBh0puJbM=.a0fe8595-dc4b-42f2-aaa9-fe4e85ecd39f@github.com> On Thu, 20 Feb 2025 09:47:50 GMT, Jatin Bhateja wrote: >> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. >> Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. >> >> Following are the performance stats for JMH micro included with the patch. >> >> >> Granite Rapids (P-core Xeon Server) >> Baseline : >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms >> >> Sierra Forest (E-core Xeon Server) >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.com... > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 > - Safety assertion added > - Review resolutions > - Lowering feature check to IR annotation level > - Adding missed feature check > - Review comments resolutions. > - Modifed scheme not based over fragile node level flags base solution. > - Updating comments for clarity > - Adding a missed check to skip over commoning of predicated vector operations > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 > - ... and 10 more: https://git.openjdk.org/jdk/compare/1e87ff01...acb613da Testing passed. Approved. Thanks for the work @jatin-bhateja :) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22863#pullrequestreview-2630003553 From duke at openjdk.org Thu Feb 20 14:20:56 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 20 Feb 2025 14:20:56 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 12:47:03 GMT, Marc Chevalier wrote: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc I agree that if we had a notion of pure function (and then, without memory output and such), we could make it more general. It surely would be nice, but it feels out of scope. If such a node gets introduced, it would be pretty easy to refactor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2671638977 From epeter at openjdk.org Thu Feb 20 14:21:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 20 Feb 2025 14:21:58 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> References: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> Message-ID: <3LpsGHdtOTwV7VDOvEoQyoFhnuAokf1Q99FpRsj7XAk=.55db146e-0316-4456-b73c-cd3aef268179@github.com> On Mon, 17 Feb 2025 03:41:11 GMT, Jasmine Karthikeyan wrote: >> Here's the pseudo-code for an implementation with 13 vector instructions. >> >> Let `fp` denote `float` or `double`. >> Correspondingly, let `P` = 24, 53 (precision); `L` = 5, 6; `W` = 2^`L` (lane width). >> >> The code below is pseudo Java and describes `W`-bit lane operations. >> Note that each line corresponds to one vector instruction. >> Further, there's no need for `xtmp3`. >> >> >> // Convert src to floating-point. >> // First ensure that the bit to the right of the leading 1, if any, is 0. >> dst = src >>> 1 >> dst = ~dst & src >> // If available, prefer a conversion instruction that interprets dst as unsigned. >> // Otherwise, a correction is needed later (see further down the code). >> dst = fpToRawBits((fp) dst) >> >> // Set xtmp1 = -1 (all one-bits) for later use >> xtmp1 = -1 >> >> // Extract the biased exponent >> xtmp2 = xtmp1 >>> P >> dst = dst >>> (P - 1) >> dst = xtmp2 & dst >> >> // Compute the exponent >> // Set xtmp2 = BIAS >> xtmp2 = xtmp1 >>> (P + 1) >> dst = dst - xtmp2 >> >> // Set xtmp2 = W - 1 >> xtmp2 = xtmp1 >>> (W - L) >> >> // Adjust for special cases. >> >> // We have: src == 0 iff dst < 0 >> // When src == 0, we force the exponent to -1 >> dst = dst >= 0 ? dst : xtmp1 // blend >> >> // When src < 0, we force the exponent to W - 1. >> // This is only needed if the conversion to floating-point above interprets its argument as signed. >> dst = src >= 0 ? dst : xtmp2 // blend >> >> // final result >> dst = xtmp2 - dst > > @rgiulietti Shifting by 1 instead of 24 is a really good idea, it makes showing the validity a lot more simple as you mention. I've applied the suggestion in the latest commit. The updated instruction sequence is also very interesting, I'd like to take a look at it in a followup RFE. I was planning on taking a closer look at the long intrinsic after this patch, since it doesn't use the floating point trick that int does and I was very curious to see what the performance would be like with it. > > @TobiHartmann I've pushed an adapted version of your test that checks for `numberOfLeadingZeros`/`numberOfTrailingZeros` correctness for int and long. Let me know what you think! @jaskarth Would it make sense to add this VectorAPI test as well? https://github.com/openjdk/jdk/pull/23579#issuecomment-2659586753 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2671640619 From jbhateja at openjdk.org Thu Feb 20 15:43:57 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 20 Feb 2025 15:43:57 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v5] In-Reply-To: References: Message-ID: <8VPiSBLMEg7I-KqrqFqPUnJX_epaRbReZOWIzzsa4ak=.c0a43600-5b76-4e00-a49f-e1fad6754cd2@github.com> On Mon, 17 Feb 2025 05:13:07 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into fix-8349637 > - Comments from review, add exhaustive test > - Improve explanation of logic > - Comments from code review > - Fix CountLeadingZerosV miscompile on AVX2 Code change looks good to me. [BLEND emulation ](https://github.com/openjdk/jdk/pull/23579#discussion_r1952232091)will improve performance on AVX2 only E-core targets. Thanks for fixing this. Best Regards, Jatin ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23579#pullrequestreview-2630306310 From jbhateja at openjdk.org Thu Feb 20 15:50:56 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 20 Feb 2025 15:50:56 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 12:07:16 GMT, Jatin Bhateja wrote: >>> @jatin-bhateja Doing the transformation to `AndF` would be a more general solution and thus better. >>> >>> > Introducing another new IR "AndF" will again need changes in auto-vectorizer. >>> >>> But currently, `CopySign` and `MoveF2I` are not vectorized anyway so we can do the vectorization of `AndF` in a separate patch without much hassle. `AndF` is vectorized into existing `AndV` nicely so it is not a too complicated work. >> >> Yes, I have a follow-up patch to auto-vectorized CopySign. >> >>> > this patch does not break existing IR invariants >>> >>> Also, what invariant can be broken by transforming `AndI(MoveF2I(x), MoveF2I(y)` into `MoveF2I(AndF(x, y))`? >> >> Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance. > >> > @jatin-bhateja Doing the transformation to `AndF` would be a more general solution and thus better. >> > > Introducing another new IR "AndF" will again need changes in auto-vectorizer. >> > >> > >> > But currently, `CopySign` and `MoveF2I` are not vectorized anyway so we can do the vectorization of `AndF` in a separate patch without much hassle. `AndF` is vectorized into existing `AndV` nicely so it is not a too complicated work. >> >> Yes, I have a follow-up patch to auto-vectorized CopySign. >> >> > > this patch does not break existing IR invariants >> > >> > >> > Also, what invariant can be broken by transforming `AndI(MoveF2I(x), MoveF2I(y)` into `MoveF2I(AndF(x, y))`? >> >> Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance. > > > Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body. > > For the time being, taking CopySign intrinsic route looks reasonable. > @jatin-bhateja let me know when this is ready for more testing / review. > > Quick comment: it seems you are not just optimizing Math.copySign as the PR title says, but also adding vector nodes. Maybe you should update the PR title? Have not looked at the code in detail to suggest a better one yet ;) Hi @eme64 , vectorization is a form of optimization, so the title is generic enough to cover both vector and scalar performance. Let me know if you have other comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2671899037 From jbhateja at openjdk.org Thu Feb 20 16:02:08 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 20 Feb 2025 16:02:08 GMT Subject: RFR: 8349138: Optimize Math.copySign API for Intel e-core targets [v2] In-Reply-To: <6I-Otx3thFLIcBATF5ggyk5fHlEQyx-NXJ2sNW_pVsE=.c7d8b1ba-1021-4593-93c7-b61636b98a7e@github.com> References: <6I-Otx3thFLIcBATF5ggyk5fHlEQyx-NXJ2sNW_pVsE=.c7d8b1ba-1021-4593-93c7-b61636b98a7e@github.com> Message-ID: On Thu, 13 Feb 2025 11:01:34 GMT, Quan Anh Mai wrote: > > Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body. > > That means we can improve the generation of floating-point constants. > > The reason I object this approach is that it is short-sighted. It's not like we cannot generate similar machine code with the more general approach. Furthermore, after we do `AndF` transformations, this patch is redundant and can be removed entirely. Hi @merykitty , the patch intends to absorb domain crossover penalty due to the movement of floating point arguments to GPRs, if we introduce a floating-point constant load penalty then we may degrade the performance. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23386#issuecomment-2671907468 From duke at openjdk.org Thu Feb 20 16:03:09 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 20 Feb 2025 16:03:09 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v31] In-Reply-To: References: Message-ID: <_YjZ00oLZAIlIgE1-4YswI_fz7oerQGYVMwsEIb3el8=.67954fd8-63ba-40f5-819c-d111621ba645@github.com> On Tue, 18 Feb 2025 02:32:02 GMT, Jasmine Karthikeyan wrote: >> Johannes Graham has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: >> >> - Merge branch 'openjdk:master' into xor_const >> - fix variable names in comments >> - update test >> - address review comments >> - formatting, remove commented tests >> - add IR tests for long, simplify tests for int >> - formatting >> - add sanity asserts to tests >> - re-add tests >> - try fewer tests >> - ... and 35 more: https://git.openjdk.org/jdk/compare/ff52859d...16049cdc > > test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 315: > >> 313: >> 314: @Test >> 315: public int testXorConstRange(int x, int y) { > > Should this have an `@IR` test attached, like `testFoldableXor`? This test is really there to do a minimal correctness test of the xor calculation, rather than verifying the IR. The xor is not expected to be eliminated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1963846393 From rgiulietti at openjdk.org Thu Feb 20 16:10:54 2025 From: rgiulietti at openjdk.org (Raffaello Giulietti) Date: Thu, 20 Feb 2025 16:10:54 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> References: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> Message-ID: On Mon, 17 Feb 2025 03:41:11 GMT, Jasmine Karthikeyan wrote: >> Here's the pseudo-code for an implementation with 13 vector instructions. >> >> Let `fp` denote `float` or `double`. >> Correspondingly, let `P` = 24, 53 (precision); `L` = 5, 6; `W` = 2^`L` (lane width). >> >> The code below is pseudo Java and describes `W`-bit lane operations. >> Note that each line corresponds to one vector instruction. >> Further, there's no need for `xtmp3`. >> >> >> // Convert src to floating-point. >> // First ensure that the bit to the right of the leading 1, if any, is 0. >> dst = src >>> 1 >> dst = ~dst & src >> // If available, prefer a conversion instruction that interprets dst as unsigned. >> // Otherwise, a correction is needed later (see further down the code). >> dst = fpToRawBits((fp) dst) >> >> // Set xtmp1 = -1 (all one-bits) for later use >> xtmp1 = -1 >> >> // Extract the biased exponent >> xtmp2 = xtmp1 >>> P >> dst = dst >>> (P - 1) >> dst = xtmp2 & dst >> >> // Compute the exponent >> // Set xtmp2 = BIAS >> xtmp2 = xtmp1 >>> (P + 1) >> dst = dst - xtmp2 >> >> // Set xtmp2 = W - 1 >> xtmp2 = xtmp1 >>> (W - L) >> >> // Adjust for special cases. >> >> // We have: src == 0 iff dst < 0 >> // When src == 0, we force the exponent to -1 >> dst = dst >= 0 ? dst : xtmp1 // blend >> >> // When src < 0, we force the exponent to W - 1. >> // This is only needed if the conversion to floating-point above interprets its argument as signed. >> dst = src >= 0 ? dst : xtmp2 // blend >> >> // final result >> dst = xtmp2 - dst > > @rgiulietti Shifting by 1 instead of 24 is a really good idea, it makes showing the validity a lot more simple as you mention. I've applied the suggestion in the latest commit. The updated instruction sequence is also very interesting, I'd like to take a look at it in a followup RFE. I was planning on taking a closer look at the long intrinsic after this patch, since it doesn't use the floating point trick that int does and I was very curious to see what the performance would be like with it. > > @TobiHartmann I've pushed an adapted version of your test that checks for `numberOfLeadingZeros`/`numberOfTrailingZeros` correctness for int and long. Let me know what you think! @jaskarth I think the special handling of max_int is useless, but you may want to integrate now and remove this handling in a followup PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2671958988 From duke at openjdk.org Thu Feb 20 16:25:48 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 20 Feb 2025 16:25:48 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v33] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: update tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/e8fc6dab..40b1f9c4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=32 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=31-32 Stats: 26 lines in 2 files changed: 18 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Thu Feb 20 16:25:52 2025 From: duke at openjdk.org (Johannes Graham) Date: Thu, 20 Feb 2025 16:25:52 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v31] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 02:32:02 GMT, Jasmine Karthikeyan wrote: >> Johannes Graham has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: >> >> - Merge branch 'openjdk:master' into xor_const >> - fix variable names in comments >> - update test >> - address review comments >> - formatting, remove commented tests >> - add IR tests for long, simplify tests for int >> - formatting >> - add sanity asserts to tests >> - re-add tests >> - try fewer tests >> - ... and 35 more: https://git.openjdk.org/jdk/compare/ff52859d...16049cdc > > test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 315: > >> 313: >> 314: @Test >> 315: public int testXorConstRange(int x, int y) { > > Should this have an `@IR` test attached, like `testFoldableXor`? I've updated that test to make more sense, and include IR checks ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1963929551 From kvn at openjdk.org Thu Feb 20 17:55:52 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Feb 2025 17:55:52 GMT Subject: RFR: 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 In-Reply-To: References: Message-ID: <02kXOy7t4Xgo14xQVlrBj_HsBuGapMnXVSZ-rLlJ4S4=.6802a90e-8ca8-4f7c-8a88-b0c3b9cd9403@github.com> On Thu, 20 Feb 2025 12:18:57 GMT, Christian Hagedorn wrote: > In the refactoring for [JDK-8344035](https://bugs.openjdk.org/browse/JDK-8344035), the value passed for the `rewire_uncommon_proj_phi_inputs` parameter in `PhaseIdealLoop::create_new_if_for_predicate()` during Loop Unswitching was accidentally flipped. It should only be set to `true` when calling it for a false-path loop, which is the cloned loop. This is currently not the case and leads to a bad graph due to folding nodes wrongly: > https://github.com/openjdk/jdk/blob/735805d9259037ae594eb4f75e96860d43feea5d/src/hotspot/share/opto/predicates.cpp#L84-L88 > > I fixed this by just flipping the parameter from `is_true_path_loop` to `is_false_path_loop` to avoid a negation. I added an additional comment to `PhaseIdealLoop::create_new_if_for_predicate()` about `rewire_uncommon_proj_phi_inputs`. > > More background about why we need `rewire_uncommon_proj_phi_inputs` in the first place can be found in the corresponding fix for [JDK-8290850](https://bugs.openjdk.org/browse/JDK-8290850): https://github.com/openjdk/jdk/pull/11452, and additionally in https://github.com/openjdk/jdk/pull/5185 > > Thanks, > Christian src/hotspot/share/opto/loopPredicate.cpp line 100: > 98: // new_iff is returned which is an IfTrue projection. This code is also used to clone predicates to cloned loops. > 99: // 'rewire_uncommon_proj_phi_inputs' should be set to the non-default value 'true' when called for a false-path loop > 100: // during Loop Unswitching. Just nitpick. Can you return back length of comment's lines? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23712#discussion_r1964092615 From kvn at openjdk.org Thu Feb 20 18:02:54 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Feb 2025 18:02:54 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 09:23:25 GMT, Marc Chevalier wrote: > But then again, I'm just getting used to this codebase, so I might be wrong. Or maybe I misunderstood and you meant something else! No, you are right. Returning TOP is bad idea. It seems we can't avoid manual rewiring. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2672265319 From mli at openjdk.org Thu Feb 20 19:21:56 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 20 Feb 2025 19:21:56 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v2] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 01:39:23 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> fix temp registers; move code > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1474: > >> 1472: >> 1473: // Compare longwords >> 1474: void C2_MacroAssembler::string_compare_long_LU(Register result, Register strL, Register strU, > > And rename this to `C2_MacroAssembler::string_compare_long_different_encoding`. We can pass one extra param (say `const bool isLU`) to distinguish the two different cases. Also I think we need to pass the `str1` and `str2` from the callsite directly as the final difference calculation needs to repect the order. The current approach doesn't seem correct: it can only distinguish L and U from the two strings, but it doesn't know the order of the two strings at all. > > Java program that hopefully helps demo the effect of the order of the two strings: > > String author = "author"; > String book = "book"; > String duplicateBook = "book"; > > assertThat(author.compareTo(book)) > .isEqualTo(-1); > assertThat(book.compareTo(author)) > .isEqualTo(1); > assertThat(duplicateBook.compareTo(book)) > .isEqualTo(0); I'll fix this, Thanks! I was wondering why the issue is not caught, seems to me there is some gap in test case for U.compareTo(L), so I created https://github.com/openjdk/jdk/pull/23705, do you mind to help to check it too? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1964206741 From mli at openjdk.org Thu Feb 20 19:32:30 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 20 Feb 2025 19:32:30 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v6] In-Reply-To: <0um0rwt97V6Q8Qen419I2t1lFga92zsYm2fOcLKWSTA=.8568e7bb-0c9c-4375-a274-b40ca3328280@github.com> References: <0um0rwt97V6Q8Qen419I2t1lFga92zsYm2fOcLKWSTA=.8568e7bb-0c9c-4375-a274-b40ca3328280@github.com> Message-ID: <77E_SURXBkDkSvgiblC53q8324DS7sl97cehvpbM9K8=.6cf2f18a-b92c-44ae-a28b-52b46ae7339d@github.com> On Thu, 20 Feb 2025 06:11:51 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: >> >> - fix unordered >> - fix > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 244: > >> 242: void reduce_mul_integral_v(Register dst, Register src1, VectorRegister src2, >> 243: VectorRegister vtmp1, VectorRegister vtmp2, BasicType bt, >> 244: uint len, VectorMask vm = Assembler::unmasked); > > Do you mind renaming this param `len` to `vector_length`? Then it will be consistent with friends `compare_integral_v`, `compare_fp_v`, etc in param naming. sure, done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23580#discussion_r1964218022 From mli at openjdk.org Thu Feb 20 19:32:29 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 20 Feb 2025 19:32:29 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v7] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? > This optimization is mainly for the vector API. > On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). > > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% > ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% > DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% > DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% > FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% > FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% > IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% > IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% > LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% > LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% > ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% > ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% > > Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: - rename - remove fp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23580/files - new: https://git.openjdk.org/jdk/pull/23580/files/b6882221..487ec26e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23580&range=05-06 Stats: 144 lines in 5 files changed: 0 ins; 130 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/23580.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23580/head:pull/23580 PR: https://git.openjdk.org/jdk/pull/23580 From bulasevich at openjdk.org Thu Feb 20 20:50:57 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 20 Feb 2025 20:50:57 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v10] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <9sHQ2GZxt0TERM5ghWCA2hArWxsdIErWZIAEJ9e1N3I=.4928b81a-be09-43a8-94c6-75e7bd645ed9@github.com> Message-ID: On Mon, 17 Feb 2025 18:47:13 GMT, Vladimir Kozlov wrote: >>> Looks good. I will submit testing. >> >> Thank you! >> >> The change is not yet ready for final testing. I still need to remove my raw access workaround in nmethod::oop_at and rebase onto #23512 once it has been integrated. > > @bulasevich my an other PR #23533 is ready. It will conflict with your changes. Are you okay if I push it first? @vnkozlov, @dean-long, @theRealAph, @stefank I have to note that my change causes a performance regression in the DaCapo luindex benchmark on AArch64 when using UseShenandoahGC mode. The regression is solely due to the additional adrp+movk instructions required to access oops from the compiled code. My intention was to reduce the nmethod size by moving oops out of the code cache. While this does reduce the space occupied by oops data, on AArch64 with Shenandoah, the extra instructions needed to access oops add back approximately 1% of the total nmethod size, effectively canceling the reduction achieved by moving oops. Given this, I now think it would be better to keep oops in nmethod while still moving other data (relocations, metadata, jvmci_data) out. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2672645456 From dlong at openjdk.org Thu Feb 20 21:24:04 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 20 Feb 2025 21:24:04 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v11] In-Reply-To: <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> Message-ID: On Tue, 18 Feb 2025 19:23:59 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description > - add a separate adrp_movk function to to support targets located more than 4GB away > - Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp > - Fixing TestFindInstMemRecursion test fail with XX:+StressReflectiveCode option: > _relocation_size can exceed 64Kb, in this case _metadata_offset do not fit into int16. > Fix: use _oops_size int16 field to calculate metadata offset > - removing dead code > - a bit of cleanup and addressing review suggestions > - rework movoop for not_supports_instruction_patching case: correcting in ldr_constant and relocations fixup > - remove _code_end_offset > - update jvm.hotspot.code.CodeBlob class > - update: mutable data for all CodeBlobs with relocations > - ... and 2 more: https://git.openjdk.org/jdk/compare/e1d0a9c8...6c3370be If we put the "oop pool" nearby, then we could continue to use the shorter instruction sequence. We could put the oops in a "DataBlob" in the codecache. With a segment codecache, they could even go in their own segment. I suspect that many nmethods are referencing the same oops. We could consider storing the oops in a deduplicated structure that GC could quickly scan, similar to OopStorage but reference-counted. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2672710391 From vlivanov at openjdk.org Thu Feb 20 21:31:53 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 20 Feb 2025 21:31:53 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 12:47:03 GMT, Marc Chevalier wrote: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc FTR I came up with a GVN-based solution (introduced a TupleNode ProjNodes can see through) when I faced a similar task (eliminate a MemBar node). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2672725956 From vlivanov at openjdk.org Thu Feb 20 21:39:53 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 20 Feb 2025 21:39:53 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 10:57:24 GMT, Quan Anh Mai wrote: > I think this is getting increasingly ad-hoc for a pretty niche use-case. Can we have a general solution that works for other pure calls (e.g trigonometric functions), too? Related: [JDK-8347901](https://bugs.openjdk.org/browse/JDK-8347901) There are cases when pure calls are Java methods (e.g., primitive boxing). As of now, they are detected in an adhoc manner (see [JDK-8075052](https://bugs.openjdk.org/browse/JDK-8075052)), but there's an RFE filed to introduce a mechanism to mark such methods: [JDK-8218414](https://bugs.openjdk.org/browse/JDK-8218414) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2672742061 From dlong at openjdk.org Thu Feb 20 22:15:02 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 20 Feb 2025 22:15:02 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v11] In-Reply-To: <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> Message-ID: On Tue, 18 Feb 2025 19:23:59 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description > - add a separate adrp_movk function to to support targets located more than 4GB away > - Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp > - Fixing TestFindInstMemRecursion test fail with XX:+StressReflectiveCode option: > _relocation_size can exceed 64Kb, in this case _metadata_offset do not fit into int16. > Fix: use _oops_size int16 field to calculate metadata offset > - removing dead code > - a bit of cleanup and addressing review suggestions > - rework movoop for not_supports_instruction_patching case: correcting in ldr_constant and relocations fixup > - remove _code_end_offset > - update jvm.hotspot.code.CodeBlob class > - update: mutable data for all CodeBlobs with relocations > - ... and 2 more: https://git.openjdk.org/jdk/compare/e1d0a9c8...6c3370be Also, it seems like there are two kinds of code density we should be concerned about: 1. not poluting icache lines with data 2. maximizing near calls in the codecache For 1), aligning embedded data on cache line boundaries would help, but for 2) we probably would want to put any nearby DataBlobs in their own codecache segment. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2672805552 From kvn at openjdk.org Thu Feb 20 22:15:03 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Feb 2025 22:15:03 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v11] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> Message-ID: On Thu, 20 Feb 2025 22:11:21 GMT, Dean Long wrote: >> Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: >> >> - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description >> - add a separate adrp_movk function to to support targets located more than 4GB away >> - Force the use of movk in combination with adrp and ldr instructions to address scenarios >> where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp >> - Fixing TestFindInstMemRecursion test fail with XX:+StressReflectiveCode option: >> _relocation_size can exceed 64Kb, in this case _metadata_offset do not fit into int16. >> Fix: use _oops_size int16 field to calculate metadata offset >> - removing dead code >> - a bit of cleanup and addressing review suggestions >> - rework movoop for not_supports_instruction_patching case: correcting in ldr_constant and relocations fixup >> - remove _code_end_offset >> - update jvm.hotspot.code.CodeBlob class >> - update: mutable data for all CodeBlobs with relocations >> - ... and 2 more: https://git.openjdk.org/jdk/compare/e1d0a9c8...6c3370be > > Also, it seems like there are two kinds of code density we should be concerned about: > > 1. not poluting icache lines with data > 2. maximizing near calls in the codecache > > For 1), aligning embedded data on cache line boundaries would help, but for 2) we probably would want to put any nearby DataBlobs in their own codecache segment. @dean-long this is good idea but for separate RFE . @bulasevich please file one. For now I agree with moving oops data back to nmethod blob. Metaspace (klass*, method*) and relocation data are more stable: mostly updated during nmethod publishing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2672806878 From kvn at openjdk.org Thu Feb 20 22:18:52 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Feb 2025 22:18:52 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 12:47:03 GMT, Marc Chevalier wrote: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Can we reuse `replace_with_con()` here? ------------- PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2631282470 From kvn at openjdk.org Thu Feb 20 22:43:10 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Feb 2025 22:43:10 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 05:53:36 GMT, Dmitry Chuyko wrote: > > Is it from here?: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/prims/jvm.cpp#L379 > > Yes, I mean this check. Configure also prevent to to build VM with JVMCI without C1 or C2: [jvm-features.m4#L517](https://github.com/openjdk/jdk/blob/master/make/autoconf/jvm-features.m4#L517) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23682#issuecomment-2672855957 From kvn at openjdk.org Thu Feb 20 22:48:52 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Feb 2025 22:48:52 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: <9iSpmytBN3FVIZctZePtb9-Lw0Dr2ZtzY15KlzP-kVo=.7b17abee-0e52-405f-ae71-af3b719d3788@github.com> On Wed, 19 Feb 2025 23:57:34 GMT, Dean Long wrote: >> The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. >> >> COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. > > I think @vnkozlov is right. I don't see where COMPILER1_OR_COMPILER2 is true for JVMCI. Should we use COMPILER1 || COMPILER2_OR_JVMCI, or remove the #if and instead guard with !PreserveFramePointer? I was about suggest to add comment to avoid confusion but then I thought what @dean-long suggested is better and don't need comment: #if defined(COMPILER1) || COMPILER2_OR_JVMCI ``` We already use such condition: [threads.cpp#L727](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/threads.cpp#L727) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23682#issuecomment-2672873577 From bulasevich at openjdk.org Thu Feb 20 23:10:06 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 20 Feb 2025 23:10:06 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v11] In-Reply-To: References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> Message-ID: On Thu, 20 Feb 2025 21:20:58 GMT, Dean Long wrote: > If we put the "oop pool" nearby, then we could continue to use the shorter instruction sequence. We could put the oops in a "DataBlob" in the codecache. That?s an interesting approach! By putting oops together, we can help the GC. However, the maximum offset for PC-relative LDR instructions in AArch64 is ?1MB, which is quite short for accessing a common DataBlob. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2672908511 From dlong at openjdk.org Thu Feb 20 23:53:02 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 20 Feb 2025 23:53:02 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v11] In-Reply-To: <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> Message-ID: On Tue, 18 Feb 2025 19:23:59 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description > - add a separate adrp_movk function to to support targets located more than 4GB away > - Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp > - Fixing TestFindInstMemRecursion test fail with XX:+StressReflectiveCode option: > _relocation_size can exceed 64Kb, in this case _metadata_offset do not fit into int16. > Fix: use _oops_size int16 field to calculate metadata offset > - removing dead code > - a bit of cleanup and addressing review suggestions > - rework movoop for not_supports_instruction_patching case: correcting in ldr_constant and relocations fixup > - remove _code_end_offset > - update jvm.hotspot.code.CodeBlob class > - update: mutable data for all CodeBlobs with relocations > - ... and 2 more: https://git.openjdk.org/jdk/compare/e1d0a9c8...6c3370be The ?1MB range is unfortunately small. We would have to place DataBlob "islands" 1MB apart, which would fragment the codecache if they were shared and had a longer lifetime than the nmethod. We could consider having a dedicated register that points to an external oop pool, but there's no guarantee that all the oops would fit in the ldr reg+off range. It's an interesting problem. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2672967089 From fyang at openjdk.org Fri Feb 21 00:14:58 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 21 Feb 2025 00:14:58 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v7] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 19:32:29 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? >> This optimization is mainly for the vector API. >> On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). >> >> >> Thanks >> >> ## Test >> >> ### jtreg >> test/jdk/jdk/incubator/vector/ >> >> ### Performance >> >> run on bananapi >> >> master vs patch >> >> Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- >> ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% >> ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% >> DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% >> DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% >> FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% >> FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% >> IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% >> IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% >> LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% >> LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% >> ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% >> ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% >> >> > > Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: > > - rename > - remove fp Updated change LGTM. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23580#pullrequestreview-2631487633 From sviswanathan at openjdk.org Fri Feb 21 00:41:02 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 21 Feb 2025 00:41:02 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v26] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: On Thu, 20 Feb 2025 09:47:50 GMT, Jatin Bhateja wrote: >> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. >> Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. >> >> Following are the performance stats for JMH micro included with the patch. >> >> >> Granite Rapids (P-core Xeon Server) >> Baseline : >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms >> >> Sierra Forest (E-core Xeon Server) >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.com... > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 > - Safety assertion added > - Review resolutions > - Lowering feature check to IR annotation level > - Adding missed feature check > - Review comments resolutions. > - Modifed scheme not based over fragile node level flags base solution. > - Updating comments for clarity > - Adding a missed check to skip over commoning of predicated vector operations > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 > - ... and 10 more: https://git.openjdk.org/jdk/compare/1e87ff01...acb613da Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22863#pullrequestreview-2631525723 From xgong at openjdk.org Fri Feb 21 01:20:58 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 21 Feb 2025 01:20:58 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations In-Reply-To: References: Message-ID: <3Fageo_yC9vhkRDaAkYd9-rV6L_FKljH-wQyKY0miZU=.ce84d9dc-5089-465b-8b44-bc5dd15eb970@github.com> On Thu, 13 Feb 2025 01:47:10 GMT, Xiaohong Gong wrote: > Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures. > > The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture, with different UseSVE options. Here is the gain details: > > > Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2 > ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x > ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x > ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x > ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x > ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x > ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x > ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x > ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x > ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x > ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x > ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x > ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x > IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x > IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x > IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x > IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x > IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x > IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x > IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x > IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x > IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x > IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x > IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x > IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x > LongMaxVector.SADD 1024 thrpt 30 82.01x 81.74x 82.153x > LongMaxVector... Hi, could anyone please help to take a look at this PR? Thanks a lot in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23608#issuecomment-2673117004 From dlong at openjdk.org Fri Feb 21 01:42:00 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 21 Feb 2025 01:42:00 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v11] In-Reply-To: <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> <4qam3fEKtXq-7w2fYkhuojgDE73_60todL54yQPhkbQ=.fb1b5c06-73f4-44de-8d78-c26281f2761b@github.com> Message-ID: <1wpYPmDFmxBZ5rz947YDVXsYqPCcsQ1lC5GXd7O6SIA=.b0c00bc8-905c-4484-bd1c-b1f6b194fdbd@github.com> On Tue, 18 Feb 2025 19:23:59 GMT, Boris Ulasevich wrote: >> This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. >> >> The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. >> >> Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. >> >> The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): >> - nmethod_count:134000, total_compilation_time: 510460ms >> - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, >> - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB >> >> Functional testing: jtreg on arm/aarch/x86. >> Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. >> >> Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. > > Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description > - add a separate adrp_movk function to to support targets located more than 4GB away > - Force the use of movk in combination with adrp and ldr instructions to address scenarios > where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp > - Fixing TestFindInstMemRecursion test fail with XX:+StressReflectiveCode option: > _relocation_size can exceed 64Kb, in this case _metadata_offset do not fit into int16. > Fix: use _oops_size int16 field to calculate metadata offset > - removing dead code > - a bit of cleanup and addressing review suggestions > - rework movoop for not_supports_instruction_patching case: correcting in ldr_constant and relocations fixup > - remove _code_end_offset > - update jvm.hotspot.code.CodeBlob class > - update: mutable data for all CodeBlobs with relocations > - ... and 2 more: https://git.openjdk.org/jdk/compare/e1d0a9c8...6c3370be Wouldn't most adrp+movk instructions for oops being computing the same or nearby base addresses? We could set up a dedicated base pointer to the external oop table at the beginning of the code, then use ldr $oop_table + offset for each oop reference. Or instead of a reserving a dedicated register that can't be used for anything else, we could allocate a regular spillable register, at the cost of worse performance if it needed to be spilled. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21276#issuecomment-2673142727 From fyang at openjdk.org Fri Feb 21 02:03:53 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 21 Feb 2025 02:03:53 GMT Subject: RFR: 8350383: Test: add more test case for string compare (UL case) In-Reply-To: <8njXwM5PksWMNTNd2N_cDZ-kvTB6wAlzGRU3MxNIAqM=.adc0ee3d-9aa7-45cf-a017-5ddf70ad75bc@github.com> References: <8njXwM5PksWMNTNd2N_cDZ-kvTB6wAlzGRU3MxNIAqM=.adc0ee3d-9aa7-45cf-a017-5ddf70ad75bc@github.com> Message-ID: <3fKkxmbCkRAkcHcpBeth-RAG91wjyStQMUfw7ypsl6E=.55de48f2-2817-42ad-b581-f754391dd06a@github.com> On Wed, 19 Feb 2025 20:04:01 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple test case improvement? > > Compared to LL/UU/LU string compare, UL case seems not enough to cover all the code path in intrinsics. This patch is to add these test case for UL string compare. > > NOTE: > * L means string in Latin, > * U means string in utf, > * LL string compare means L.compareTo(L), > * LU string compare means L.compareTo(U), > * and so on. > > Thanks! Looks good to me. Will this cover the case we discussed? https://github.com/openjdk/jdk/pull/23633#discussion_r1960826250 ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23705#pullrequestreview-2631643599 From dean.long at oracle.com Fri Feb 21 02:07:54 2025 From: dean.long at oracle.com (dean.long at oracle.com) Date: Thu, 20 Feb 2025 18:07:54 -0800 Subject: When does C1 use the same register for inputs and temps? In-Reply-To: <09ea4591-c9c4-422e-9e95-da95ba9f6a6a@littlepinkcloud.com> References: <09ea4591-c9c4-422e-9e95-da95ba9f6a6a@littlepinkcloud.com> Message-ID: <3103c8e3-3b42-4bed-901e-77840e98b1f8@oracle.com> On 2/20/25 3:38 AM, Andrew Haley wrote: > Was it simply that way back when, people were trying to use the absolute > minimum of registers? That would be my guess, as it's the only reason I can think of to do it.? It might have been nice to have the decision made in platform-specific code rather than in shared code where it is now. dl From jkarthikeyan at openjdk.org Fri Feb 21 02:25:34 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 21 Feb 2025 02:25:34 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v6] In-Reply-To: References: Message-ID: > Hi all, > This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. > > This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. > > Reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Add Vector API Test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23579/files - new: https://git.openjdk.org/jdk/pull/23579/files/e8820bcb..5434085c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23579&range=04-05 Stats: 41 lines in 1 file changed: 39 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23579.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23579/head:pull/23579 PR: https://git.openjdk.org/jdk/pull/23579 From jkarthikeyan at openjdk.org Fri Feb 21 02:45:55 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 21 Feb 2025 02:45:55 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <3LpsGHdtOTwV7VDOvEoQyoFhnuAokf1Q99FpRsj7XAk=.55db146e-0316-4456-b73c-cd3aef268179@github.com> References: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> <3LpsGHdtOTwV7VDOvEoQyoFhnuAokf1Q99FpRsj7XAk=.55db146e-0316-4456-b73c-cd3aef268179@github.com> Message-ID: On Thu, 20 Feb 2025 14:19:15 GMT, Emanuel Peter wrote: >> @rgiulietti Shifting by 1 instead of 24 is a really good idea, it makes showing the validity a lot more simple as you mention. I've applied the suggestion in the latest commit. The updated instruction sequence is also very interesting, I'd like to take a look at it in a followup RFE. I was planning on taking a closer look at the long intrinsic after this patch, since it doesn't use the floating point trick that int does and I was very curious to see what the performance would be like with it. >> >> @TobiHartmann I've pushed an adapted version of your test that checks for `numberOfLeadingZeros`/`numberOfTrailingZeros` correctness for int and long. Let me know what you think! > > @jaskarth Would it make sense to add this VectorAPI test as well? > https://github.com/openjdk/jdk/pull/23579#issuecomment-2659586753 @eme64 I think adding the Vector API test is a good idea. I added a new IR test to the file that exercises the logic with the vector api, and checked that it fails without the patch and passes with it. Let me know what you think! @rgiulietti I agree, I think it would be better to keep this as a bug fix and clean up the logic in a followup patch. I filed an RFE for it: https://bugs.openjdk.org/browse/JDK-8350468 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2673213072 From jkarthikeyan at openjdk.org Fri Feb 21 02:52:02 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 21 Feb 2025 02:52:02 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v33] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 16:25:48 GMT, Johannes Graham wrote: >> An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. >> >> In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: >> - Bounds optimization of xor >> - A check for `x ^ x = 0` >> - Explicit testing of xor over booleans. >> >> Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. >> >> --------- >> ### Progress >> - [x] Change must not contain extraneous whitespace >> - [x] Commit message must refer to an issue >> - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) >> >> >> >> ### Reviewers >> * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) >> >> ### Reviewing >>
Using git >> >> Checkout this PR locally: \ >> `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ >> `$ git checkout pull/23089` >> >> Update a local copy of the PR: \ >> `$ git checkout pull/23089` \ >> `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` >> >>
>>
Using Skara CLI tools >> >> Checkout this PR locally: \ >> `$ git pr checkout 23089` >> >> View PR using the GUI difftool: \ >> `$ git pr show -t 23089` >> >>
>>
Using diff file >> >> Download this PR as a diff file: \ >> https://git.openjdk.org/jdk/pull/23089.diff >> >>
>>
Using Webrev >> >> [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-25939... > > Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: > > update tests Thanks for the update! It looks good to me. ------------- Marked as reviewed by jkarthikeyan (Committer). PR Review: https://git.openjdk.org/jdk/pull/23089#pullrequestreview-2631705601 From jbhateja at openjdk.org Fri Feb 21 05:33:57 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 21 Feb 2025 05:33:57 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v2] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> <9Mx8rZTEbpj0Cxbxl6qYDSipvRA0wWYhjcw2BI7gYzo=.2217a5a9-f031-43db-b0ed-3220a38de382@github.com> Message-ID: <7SktQDifEFWpvzYOaKuurdp2nrj1I7bWsuRyx8-BKjE=.383dcd2c-dd9b-4a01-b17e-e84686722a59@github.com> On Tue, 7 Jan 2025 20:59:26 GMT, Vladimir Ivanov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> removing spaces > > Nice improvement, Jatin. Hi @iwanowww , Can you kindly review and approve, we need another approval here. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2673505058 From sspitsyn at openjdk.org Fri Feb 21 06:17:01 2025 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Fri, 21 Feb 2025 06:17:01 GMT Subject: RFR: 8350287: Cleanup SA's support for CodeBlob subclasses [v2] In-Reply-To: References: Message-ID: <2CJil-5PMv-VPzBH_Otga48W7AGMybgy0Kr8_XIztJY=.36640c32-77af-4166-96be-488784190ff3@github.com> On Wed, 19 Feb 2025 05:49:56 GMT, Chris Plummer wrote: >> There is a lot of subclassing of CodeBlob types done in SA to mimic hotspot, but most of it is unnecessary. The generic CodeBlob class can handle all support needed by most of the subclasses. The only subclasses we need to keep around around NMethod, RuntimeStub, and UpcallStub, since they all have special support in SA. I also kept around RuntimeBlob so RuntimeStub can continue to inherit from it and be consistent with hotspot, but it's not actually necessary, and I'm more than happy to remove it also. >> >> I also cleaned up the PStack support for CodeBlobs. It can just use CodeBlob.getName() rather than trying to figure out the type of the CodeBlob instance to print out type name. This allows us to get rid of most isXXX() APIs. It also provides more useful output in some cases. >> >> There is some minor loss of functionality in some of the CodeBlob subtypes I removed. For example this is what AdapterBlob.getName() looked like (it is now gone): >> >> >> public String getName() { >> return "AdapterBlob: " + super.getName(); >> } >> >> >> So now we just use the default CodeBlob.getName(), which is what super.getName() would up execute. I think for AdapterBlob this always returns "I2C/C2I adapters", so now you only get this rather than "AdapterBlob: I2C/C2I adapters". We have a similar loss of getName() detail with MethodHandlesAdapterBlob (now returns "MethodHandles adapters") and VtableBlob (now returns "vtable chunks"). Basically for these 3 CodeBlob types getName() will no longer include the CodeBlob type. I could special case them in CodeBlob.getName() by fetching the kind to determine what the proper name should be. Let me know if you think it is worth it. > > Chris Plummer has updated the pull request incrementally with one additional commit since the last revision: > > Minor improvements. Looks good. ------------- Marked as reviewed by sspitsyn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23684#pullrequestreview-2632089455 From epeter at openjdk.org Fri Feb 21 06:19:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 21 Feb 2025 06:19:00 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v3] In-Reply-To: <3LpsGHdtOTwV7VDOvEoQyoFhnuAokf1Q99FpRsj7XAk=.55db146e-0316-4456-b73c-cd3aef268179@github.com> References: <6yD3oEDRMPClNVVkEi64IAbnT4fOiMbgCjx6xWXU3bk=.1cb181b8-619c-47bd-91cf-2a230566442f@github.com> <3LpsGHdtOTwV7VDOvEoQyoFhnuAokf1Q99FpRsj7XAk=.55db146e-0316-4456-b73c-cd3aef268179@github.com> Message-ID: <0YifIyRDrSqN_TqvHD42GY2X_k_tR_smtjOj6qJd0-I=.2e2a4931-eac6-4be5-bf62-1ed0c6c39e23@github.com> On Thu, 20 Feb 2025 14:19:15 GMT, Emanuel Peter wrote: >> @rgiulietti Shifting by 1 instead of 24 is a really good idea, it makes showing the validity a lot more simple as you mention. I've applied the suggestion in the latest commit. The updated instruction sequence is also very interesting, I'd like to take a look at it in a followup RFE. I was planning on taking a closer look at the long intrinsic after this patch, since it doesn't use the floating point trick that int does and I was very curious to see what the performance would be like with it. >> >> @TobiHartmann I've pushed an adapted version of your test that checks for `numberOfLeadingZeros`/`numberOfTrailingZeros` correctness for int and long. Let me know what you think! > > @jaskarth Would it make sense to add this VectorAPI test as well? > https://github.com/openjdk/jdk/pull/23579#issuecomment-2659586753 > @eme64 I think adding the Vector API test is a good idea. I added a new IR test to the file that exercises the logic with the vector api, and checked that it fails without the patch and passes with it. Let me know what you think! @jaskarth That sounds good, thanks for adding it! I'm not going to review this, as there are already 3 doing that ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2673570178 From fyang at openjdk.org Fri Feb 21 06:43:23 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 21 Feb 2025 06:43:23 GMT Subject: RFR: 8350480: RISC-V: Relax assertion about registers in C2_MacroAssembler::minmax_fp Message-ID: Hi, please review this trivial change. The current assersion about input registers is more than needed. It requires that `dst`, `src1` and `src2` must be different from each other. But the code only required that `dst` must be different from `src1` and `src2`. Patch simply relaxed the assersion removing the unneeded constraint. ------------- Commit messages: - 8350480: RISC-V: Relax assertion about registers in C2_MacroAssembler::minmax_fp Changes: https://git.openjdk.org/jdk/pull/23723/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23723&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350480 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23723.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23723/head:pull/23723 PR: https://git.openjdk.org/jdk/pull/23723 From epeter at openjdk.org Fri Feb 21 06:44:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 21 Feb 2025 06:44:57 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v6] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 05:14:00 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: >> >> >> Baseline Patch >> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement >> VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) >> VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) >> VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) >> >> >> I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Merge branch 'master' into vectorize-subword > - Address comments from review, refactor test > - Add new conversions to benchmark > - Fix some tests that now vectorize > - Implement widening and address comments from review > - Subword vectorization I had another quick look, and have some additional questions :) src/hotspot/cpu/aarch64/matcher_aarch64.hpp line 203: > 201: static bool is_vector_cast_supported(BasicType from_bt, BasicType to_bt) { > 202: return false; > 203: } Do these simply not exist, or do you just want to leave that to a future RFE so someone else can take care of this? src/hotspot/cpu/x86/matcher_x86.hpp line 264: > 262: } > 263: > 264: static bool is_vector_cast_supported(BasicType from_bt, BasicType to_bt) { Why does this not live next to `Matcher::match_rule_supported_vector`, would that not be a better fit? src/hotspot/share/opto/superwordVTransformBuilder.cpp line 194: > 192: > 193: // If the use and def types are different, emit a cast node > 194: if (use_bt != def_bt && !p0->is_Convert() && Matcher::is_vector_cast_supported(def_bt, use_bt)) { The usual way we check if a vector instruction is implemented is to use `VectorNode::implemented`. Ah, actually there is a `VectorCastNode::implemented`. Why are you not using that one? ------------- PR Review: https://git.openjdk.org/jdk/pull/23413#pullrequestreview-2632120544 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1964928232 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1964929822 PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r1964934017 From epeter at openjdk.org Fri Feb 21 07:04:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 21 Feb 2025 07:04:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov wrote: >> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti ng benchmark results there. >> >> Does that sound ok? >> >>> Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > >> > Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > > You should not worry about `-Xcomp` it is testing flag - we can use some default there. > I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes. > > I will look on changes more later. @vnkozlov I made the change with the probability `PROB_FAIR` -> `PROB_LIKELY_MAG(3)` and ran testing again. @rwestrel Do you want me to find examples for the pre-loop disappearing, I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2673745463 From dchuyko at openjdk.org Fri Feb 21 08:43:30 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Fri, 21 Feb 2025 08:43:30 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 [v2] In-Reply-To: References: Message-ID: > The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. > > COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: Full #if condition ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23682/files - new: https://git.openjdk.org/jdk/pull/23682/files/8c273575..d157893c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23682&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23682&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23682/head:pull/23682 PR: https://git.openjdk.org/jdk/pull/23682 From dchuyko at openjdk.org Fri Feb 21 08:50:54 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Fri, 21 Feb 2025 08:50:54 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 [v2] In-Reply-To: References: Message-ID: <-3LUunUZxEf0ybgUGLnprYt4T3QZpC7afFsyrRwSwKQ=.fbfa33a9-e16c-485a-b8d1-5a19d2bde57c@github.com> On Fri, 21 Feb 2025 08:43:30 GMT, Dmitry Chuyko wrote: >> The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. >> >> COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Full #if condition OK, changed to a full condition. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23682#issuecomment-2673946926 From mli at openjdk.org Fri Feb 21 09:51:58 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 21 Feb 2025 09:51:58 GMT Subject: RFR: 8350383: Test: add more test case for string compare (UL case) In-Reply-To: <8njXwM5PksWMNTNd2N_cDZ-kvTB6wAlzGRU3MxNIAqM=.adc0ee3d-9aa7-45cf-a017-5ddf70ad75bc@github.com> References: <8njXwM5PksWMNTNd2N_cDZ-kvTB6wAlzGRU3MxNIAqM=.adc0ee3d-9aa7-45cf-a017-5ddf70ad75bc@github.com> Message-ID: On Wed, 19 Feb 2025 20:04:01 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple test case improvement? > > Compared to LL/UU/LU string compare, UL case seems not enough to cover all the code path in intrinsics. This patch is to add these test case for UL string compare. > > NOTE: > * L means string in Latin, > * U means string in utf, > * LL string compare means L.compareTo(L), > * LU string compare means L.compareTo(U), > * and so on. > > Thanks! > Looks good to me. Will this cover the case we discussed? [#23633 (comment)](https://github.com/openjdk/jdk/pull/23633#discussion_r1960826250) Thank you! It will cover the code path of UL. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23705#issuecomment-2674076533 From mli at openjdk.org Fri Feb 21 09:51:59 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 21 Feb 2025 09:51:59 GMT Subject: Integrated: 8350383: Test: add more test case for string compare (UL case) In-Reply-To: <8njXwM5PksWMNTNd2N_cDZ-kvTB6wAlzGRU3MxNIAqM=.adc0ee3d-9aa7-45cf-a017-5ddf70ad75bc@github.com> References: <8njXwM5PksWMNTNd2N_cDZ-kvTB6wAlzGRU3MxNIAqM=.adc0ee3d-9aa7-45cf-a017-5ddf70ad75bc@github.com> Message-ID: On Wed, 19 Feb 2025 20:04:01 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple test case improvement? > > Compared to LL/UU/LU string compare, UL case seems not enough to cover all the code path in intrinsics. This patch is to add these test case for UL string compare. > > NOTE: > * L means string in Latin, > * U means string in utf, > * LL string compare means L.compareTo(L), > * LU string compare means L.compareTo(U), > * and so on. > > Thanks! This pull request has now been integrated. Changeset: c73fead5 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/c73fead5caea8008586b31a5009c64011637b8cc Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8350383: Test: add more test case for string compare (UL case) Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/23705 From mli at openjdk.org Fri Feb 21 10:00:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 21 Feb 2025 10:00:55 GMT Subject: RFR: 8321003: RISC-V: C2 MulReductionVI [v7] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 00:12:12 GMT, Fei Yang wrote: > Updated change LGTM. Thanks. Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23580#issuecomment-2674098993 From mli at openjdk.org Fri Feb 21 10:29:01 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 21 Feb 2025 10:29:01 GMT Subject: Integrated: 8321003: RISC-V: C2 MulReductionVI In-Reply-To: References: Message-ID: <_rIulS-3ozIEz-M_ils-hdx3HDq8Yzp4jyi8kXt_TNg=.1558c090-b1f6-4f5f-9f77-14a4343a3be8@github.com> On Wed, 12 Feb 2025 09:52:09 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch to implement MulReductionVI/MulReductionVL/MulReductionVF/MulReductionVD? > This optimization is mainly for the vector API. > On riscv, there is no straightforward instructions to do it, but we can do it with a reduction tree, which could reduce the time complexity to lg(N). > > > Thanks > > ## Test > > ### jtreg > test/jdk/jdk/incubator/vector/ > > ### Performance > > run on bananapi > > master vs patch > > Benchmark | (size) | Mode | Cnt | Score - master | Error - patch | Score - patch | Error - patch | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- | -- > ByteMaxVector.MULLanes | 1024 | avgt | 10 | 11170.052 | 499.676 | 1294.424 | 8.346 | ns/op | 88.40% > ByteMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 12514.587 | 55.06 | 1413.028 | 0.259 | ns/op | 88.70% > DoubleMaxVector.MULLanes | 1024 | avgt | 10 | 57672.51 | 1750.417 | 4775.633 | 1.454 | ns/op | 91.70% > DoubleMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63656.523 | 1063.048 | 5899.259 | 1.692 | ns/op | 90.70% > FloatMaxVector.MULLanes | 1024 | avgt | 10 | 30997.218 | 728.73 | 2473.069 | 5.84 | ns/op | 92.00% > FloatMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 35515.329 | 227.873 | 3284.17 | 0.608 | ns/op | 90.80% > IntMaxVector.MULLanes | 1024 | avgt | 10 | 31130.453 | 878.261 | 3304.118 | 5.96 | ns/op | 89.40% > IntMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 36851.976 | 394.001 | 3969.407 | 0.511 | ns/op | 89.20% > LongMaxVector.MULLanes | 1024 | avgt | 10 | 58795.752 | 1030.985 | 6883.995 | 3.15 | ns/op | 88.30% > LongMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 63642.904 | 386.521 | 7892.735 | 9.359 | ns/op | 87.60% > ShortMaxVector.MULLanes | 1024 | avgt | 10 | 16857.441 | 762.428 | 2287.141 | 0.186 | ns/op | 86.40% > ShortMaxVector.MULMaskedLanes | 1024 | avgt | 10 | 21171.375 | 74.684 | 2532.913 | 0.274 | ns/op | 88.00% > > This pull request has now been integrated. Changeset: 1b6281d9 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/1b6281d98cf0e7c5435c563bfedd6f07b79bfa62 Stats: 124 lines in 6 files changed: 123 ins; 0 del; 1 mod 8321003: RISC-V: C2 MulReductionVI 8321004: RISC-V: C2 MulReductionVL Reviewed-by: fyang, rehn ------------- PR: https://git.openjdk.org/jdk/pull/23580 From mli at openjdk.org Fri Feb 21 10:35:51 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 21 Feb 2025 10:35:51 GMT Subject: RFR: 8350480: RISC-V: Relax assertion about registers in C2_MacroAssembler::minmax_fp In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 06:36:52 GMT, Fei Yang wrote: > Hi, please review this trivial change. > The current assertion about the registers is more than needed. > It requires that `dst`, `src1` and `src2` must be different from each other. > But the code only required that `dst` must be different from `src1` and `src2`. > Patch simply relaxes the assersion removing the unneeded constraint. > fastdebug builds OK with change. Looks good. ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23723#pullrequestreview-2632648676 From aph-open at littlepinkcloud.com Fri Feb 21 10:47:26 2025 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Fri, 21 Feb 2025 10:47:26 +0000 Subject: When does C1 use the same register for inputs and temps? In-Reply-To: <3103c8e3-3b42-4bed-901e-77840e98b1f8@oracle.com> References: <09ea4591-c9c4-422e-9e95-da95ba9f6a6a@littlepinkcloud.com> <3103c8e3-3b42-4bed-901e-77840e98b1f8@oracle.com> Message-ID: On 2/21/25 02:07, dean.long at oracle.com wrote: > On 2/20/25 3:38 AM, Andrew Haley wrote: > >> Was it simply that way back when, people were trying to use the absolute >> minimum of registers? > > That would be my guess, as it's the only reason I can think of to do > it.? It might have been nice to have the decision made in > platform-specific code rather than in shared code where it is now. That makes sense, thanks. I guess they were trying to compile for x32, a machine with only 8 registers. In that case, an operation with, say, 3 inputs and 3 temps would be quite a challenge. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From fjiang at openjdk.org Fri Feb 21 13:34:52 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 21 Feb 2025 13:34:52 GMT Subject: RFR: 8350480: RISC-V: Relax assertion about registers in C2_MacroAssembler::minmax_fp In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 06:36:52 GMT, Fei Yang wrote: > Hi, please review this trivial change. > The current assertion about the registers is more than needed. > It requires that `dst`, `src1` and `src2` must be different from each other. > But the code only required that `dst` must be different from `src1` and `src2`. > Patch simply relaxes the assersion removing the unneeded constraint. > fastdebug builds OK with change. Thanks! ------------- Marked as reviewed by fjiang (Committer). PR Review: https://git.openjdk.org/jdk/pull/23723#pullrequestreview-2633053466 From qamai at openjdk.org Fri Feb 21 14:07:54 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 21 Feb 2025 14:07:54 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 14:18:40 GMT, Marc Chevalier wrote: > It surely would be nice, but it feels out of scope. I strongly disagree with this sentiment. This patch fixes one particular issue (dead code elimination) with 1 particular kind of operation (floating-point remainder, which is IMO a really niche operation). This patch does not fix other issues (GVN, scheduling, floating-point remainder alters memory unnecessarily, etc) or does not fix the issue with other similar operations. I don't think the benefits justify adding this band-aid. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2674639653 From thartmann at openjdk.org Fri Feb 21 14:30:52 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 21 Feb 2025 14:30:52 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 14:05:10 GMT, Quan Anh Mai wrote: >> I agree that if we had a notion of pure function (and then, without memory output and such), we could make it more general. It surely would be nice, but it feels out of scope. If such a node gets introduced, it would be pretty easy to refactor. > >> It surely would be nice, but it feels out of scope. > > I strongly disagree with this sentiment. This patch fixes one particular issue (dead code elimination) with 1 particular kind of operation (floating-point remainder, which is IMO a really niche operation). This patch does not fix other issues (GVN, scheduling, floating-point remainder alters memory unnecessarily, etc) or does not fix the issue with other similar operations. I don't think the benefits justify adding this band-aid. @merykitty For context, while this is certainly an edge case, the issue was originally reported by a customer via [JDK-8349364](https://bugs.openjdk.org/browse/JDK-8349364). This patch is essentially a follow-up to [JDK-8345766](https://bugs.openjdk.org/browse/JDK-8345766), which was incomplete in that it did not remove unused operations. I think we all agree that a more general solution would be ideal, but given its complexity, addressing this specific issue first seems like a reasonable approach. Also, @marc-chevalier is new to the team - let's make sure we don?t overwhelm him! :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2674697794 From dfenacci at openjdk.org Fri Feb 21 14:35:54 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Fri, 21 Feb 2025 14:35:54 GMT Subject: RFR: 8348645: IGV: visualize live ranges [v4] In-Reply-To: References: Message-ID: <4l_qJPy1rXrf2nD_KhAwTL7egnjWVJbL1wxGUQG-Aok=.c5316298-67c2-4ea8-8927-48e797f20641@github.com> On Thu, 20 Feb 2025 13:22:17 GMT, Roberto Casta?eda Lozano wrote: >> This changeset extends IGV with live range visualization. It introduces live ranges as first-class IGV entities and displays them along with the control-flow graph in the CFG view. Visualizing liveness information should hopefully make C2's register allocator easier to understand, diagnose, debug, and enhance. >> >> Live ranges are visible in C2 phases where liveness information is available, that is, phases `Initial liveness` to `Fix up spills` at IGV print level 4 or greater. For example, running a debug build of the JVM as follows: >> >> >> java -Xbatch -XX:CompileCommand=IGVPrintLevel,java.util.HashMap::newNode,4 >> >> >> produces the following visualization for the `Initial spilling` phase: >> >> ![initial-spilling](https://github.com/user-attachments/assets/1ecf74f5-92a8-4866-b1ec-2323bb0c428e) >> >> Live ranges are first-class IGV entities, meaning that the user can: >> >> - search, select, and extract them; >> >> ![search-extract](https://github.com/user-attachments/assets/8e0dfa59-457f-49cb-b2b5-1d202301c79d) >> >> - examine their properties in the `Properties` window or via tooltips; >> >> ![properties](https://github.com/user-attachments/assets/68d2d23b-b986-4d2e-835c-b661bce0de23) >> >> - navigate to related IGV entities via a pop-up menu; and >> >> ![popup](https://github.com/user-attachments/assets/21de2fef-d36a-42d5-b828-2696d87a18ea) >> >> - program filters that act om them according to their properties. >> >> ![filters](https://github.com/user-attachments/assets/e993b067-d0b8-452c-a885-c4e601e31e1c) >> >> Live ranges are connected to nodes by a use-def relation: a node can define zero or one live ranges, and use multiple live ranges; a live range can be defined and used by multiple nodes. Consequently, a live range in IGV is visible if and only if all its related nodes are visible (fully or semi-transparently). Generally, the start and end of a live range are vertically aligned with the nodes that first define and last use the live range. To reflect accurately the semantics of Phi nodes w.r.t. liveness, the visualization treats live ranges related by Phi nodes specially: live ranges used by a Phi node end at the bottom of the corresponding predecessor basic blocks, whereas live ranges defined by a Phi node start at the top of the node's basic block. The following screenshot shows an example of a Phi node (`48 Phi`) joining live ranges `L8` and `L13` into `L15`: >> >> ![phi](https://github.com/user-attachments/assets/0ef8aa1d-523d-4391-982e-6b74c2016a3c... > > Roberto Casta?eda Lozano has updated the pull request incrementally with three additional commits since the last revision: > > - Handle single-block CFGs > - Open and close live ranges joined by phis in their respective blocks > - Export liveness information when saving a graph from IGV > Done (commit [31e4510](https://github.com/openjdk/jdk/commit/31e4510e3f315b01e54dcde29ad56d1ac807449c)). This turned out to be a bit more involved than I thought, please check that the changes meet your expectations. Nice! I like that the live range end corresponds to the bottom of the block in the `Phi` case. Thanks @robcasloz! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23558#issuecomment-2674711431 From mli at openjdk.org Fri Feb 21 14:48:12 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 21 Feb 2025 14:48:12 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v3] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > Currently, `string_compare` code is a bit complicated, main reasons include: > 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. > 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. > > This is not good for code reading and maintaining. > > > So, this patch does following refactoring: > 1. merge LU and UL code into one, i.e. remove UL code. > 2. seperate the code into 2 methods: LL/UU and LU/UL. > 3. some other misc improvement. > > I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. > 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. > 2. make `SHORT_STRING` case simpler. > > > > Thanks Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: - fix UL and test - Merge branch 'master' into refactor-string-compare - minor - fix temp registers; move code - blank lines - simplify - clean - merge UL and LU - move to functions - move alignment code of LL&UU down from common code path - ... and 1 more: https://git.openjdk.org/jdk/compare/b6520b23...4f5ae272 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23633/files - new: https://git.openjdk.org/jdk/pull/23633/files/543c8635..4f5ae272 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=01-02 Stats: 19487 lines in 923 files changed: 13116 ins; 3380 del; 2991 mod Patch: https://git.openjdk.org/jdk/pull/23633.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23633/head:pull/23633 PR: https://git.openjdk.org/jdk/pull/23633 From mli at openjdk.org Fri Feb 21 15:02:52 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 21 Feb 2025 15:02:52 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v2] In-Reply-To: References: Message-ID: <943dSbjvCXth4cdCZbNbhdrUZuxHW3GCZd5NgoAb5PM=.cc22aa30-3642-43ca-9e05-e06c63ff29b3@github.com> On Thu, 20 Feb 2025 19:19:44 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1474: >> >>> 1472: >>> 1473: // Compare longwords >>> 1474: void C2_MacroAssembler::string_compare_long_LU(Register result, Register strL, Register strU, >> >> And rename this to `C2_MacroAssembler::string_compare_long_different_encoding`. We can pass one extra param (say `const bool isLU`) to distinguish the two different cases. Also I think we need to pass the `str1` and `str2` from the callsite directly as the final difference calculation needs to repect the order. The current approach doesn't seem correct: it can only distinguish L and U from the two strings, but it doesn't know the order of the two strings at all. >> >> Java program that hopefully helps demo the effect of the order of the two strings: >> >> String author = "author"; >> String book = "book"; >> String duplicateBook = "book"; >> >> assertThat(author.compareTo(book)) >> .isEqualTo(-1); >> assertThat(book.compareTo(author)) >> .isEqualTo(1); >> assertThat(duplicateBook.compareTo(book)) >> .isEqualTo(0); > > I'll fix this, Thanks! > > I was wondering why the issue is not caught, seems to me there is some gap in test case for U.compareTo(L), so I created https://github.com/openjdk/jdk/pull/23705, do you mind to help to check it too? I fixed the issue. And also add another simple test case to help catch the potential issue. We could add more tests, but maybe later in another pr. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1965630989 From duke at openjdk.org Fri Feb 21 15:41:34 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 21 Feb 2025 15:41:34 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v2] In-Reply-To: References: Message-ID: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Fix and use replace_with_con ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23694/files - new: https://git.openjdk.org/jdk/pull/23694/files/a9d58c85..40af6f13 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=00-01 Stats: 47 lines in 2 files changed: 2 ins; 24 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/23694.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23694/head:pull/23694 PR: https://git.openjdk.org/jdk/pull/23694 From duke at openjdk.org Fri Feb 21 15:41:34 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 21 Feb 2025 15:41:34 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v2] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 22:16:38 GMT, Vladimir Kozlov wrote: > Can we reuse `replace_with_con()` here? That is a good point. Actually, I had to fix it a bit: it uses `Compile::gvn_replace_by` but is actually only called by `Mod{D,F}Node::Ideal` which both starts with https://github.com/openjdk/jdk/blob/dfcd0df60c60cf89dc01682264a573ad39e61a17/src/hotspot/share/opto/divnode.cpp#L1560-L1562 so actually run only in IGVN. It's benign, but useless work: `PhaseIterGVN::replace_node` is better fit: it does the rewiring and add to the worklist as well, but doesn't do the hashing magic. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2674877423 From duke at openjdk.org Fri Feb 21 16:02:26 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 21 Feb 2025 16:02:26 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants Message-ID: This collapses double shift lefts by constants in a single constant: (x << con1) << con2 => x << (con1 + con2). Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. Moreover, the simplification logic of the sign extension trick had to be improved. For instance, we use `(x << 16) >> 16` to convert a 32 bits into a 16 bits integer, with sign extension. When storing this into a 16-bit field, this can be simplified into simple `x`. But in the case where `x` is itself a left-shift expression, say `y << 3`, this PR makes the IR looks like `(y << 19) >> 16` instead of the old `((y << 3) << 16) >> 16`. The former logic didn't handle the case where the left and the right shift have different magnitude. In this PR, I generalize this simplification to cases where the left shift has a larger magnitude than the right shift. This improvement was needed not to miss vectorization opportunities: without the simplification, we have a left shift and a right shift instead of a single left shift, which confuses the type inference. This also works for multiplications by powers of 2 since they are already translated into shifts. Thanks, Marc ------------- Commit messages: - improve simplification of double shifts in stores - actually return a new node - format - fix type bug - clang-format - register for igvn - more tests - collapse lshift with constants Changes: https://git.openjdk.org/jdk/pull/23728/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8347459 Stats: 272 lines in 5 files changed: 262 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/23728.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23728/head:pull/23728 PR: https://git.openjdk.org/jdk/pull/23728 From kvn at openjdk.org Fri Feb 21 18:37:55 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Feb 2025 18:37:55 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 15:41:34 GMT, Marc Chevalier wrote: >> Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Fix and use replace_with_con Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2633926597 From kvn at openjdk.org Fri Feb 21 19:08:01 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Feb 2025 19:08:01 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 07:21:45 GMT, Emanuel Peter wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > adjust selector if probability How profitable (performance wise) to optimize slow path loop? Can we skip any optimizations for it - treat it as not-Counted? src/hotspot/share/opto/loopTransform.cpp line 3363: > 3361: if (cl->is_pre_loop() || cl->is_post_loop()) return true; > 3362: > 3363: // If we are stalled, check if we can get unstalled. Can you expand comment explaining cases when we "stall" and what it means? src/hotspot/share/opto/loopopts.cpp line 4514: > 4512: // and then rejecting the slow_loop by constant folding the multiversion_if. > 4513: // > 4514: // Therefore, we "stall" the optimization of the slow_loop until we add We don't use "stall" term. We use "delay" - this is what happens here if I understand it correctly. src/hotspot/share/opto/loopopts.cpp line 4520: > 4518: // multiversion_if folds away the "stalled" slow_loop. If we add any > 4519: // speculative assumption, then we mark the OpaqueMultiversioningNode > 4520: // with "unstall_slow_loop", so that the slow_loop can be optimized. "unstall_slow_loop" - > "optimize_slow_loop" ------------- PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2633960596 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966019182 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966028103 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966032230 From cjplummer at openjdk.org Fri Feb 21 19:12:00 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Fri, 21 Feb 2025 19:12:00 GMT Subject: RFR: 8350287: Cleanup SA's support for CodeBlob subclasses [v2] In-Reply-To: References: Message-ID: <3R9wsUs6rIJPcbg5Rwd2zBfc-PcAur1HquvuL52jh9I=.4896d421-78f9-4dbd-91b4-52c39075cf9d@github.com> On Wed, 19 Feb 2025 05:49:56 GMT, Chris Plummer wrote: >> There is a lot of subclassing of CodeBlob types done in SA to mimic hotspot, but most of it is unnecessary. The generic CodeBlob class can handle all support needed by most of the subclasses. The only subclasses we need to keep around around NMethod, RuntimeStub, and UpcallStub, since they all have special support in SA. I also kept around RuntimeBlob so RuntimeStub can continue to inherit from it and be consistent with hotspot, but it's not actually necessary, and I'm more than happy to remove it also. >> >> I also cleaned up the PStack support for CodeBlobs. It can just use CodeBlob.getName() rather than trying to figure out the type of the CodeBlob instance to print out type name. This allows us to get rid of most isXXX() APIs. It also provides more useful output in some cases. >> >> There is some minor loss of functionality in some of the CodeBlob subtypes I removed. For example this is what AdapterBlob.getName() looked like (it is now gone): >> >> >> public String getName() { >> return "AdapterBlob: " + super.getName(); >> } >> >> >> So now we just use the default CodeBlob.getName(), which is what super.getName() would up execute. I think for AdapterBlob this always returns "I2C/C2I adapters", so now you only get this rather than "AdapterBlob: I2C/C2I adapters". We have a similar loss of getName() detail with MethodHandlesAdapterBlob (now returns "MethodHandles adapters") and VtableBlob (now returns "vtable chunks"). Basically for these 3 CodeBlob types getName() will no longer include the CodeBlob type. I could special case them in CodeBlob.getName() by fetching the kind to determine what the proper name should be. Let me know if you think it is worth it. > > Chris Plummer has updated the pull request incrementally with one additional commit since the last revision: > > Minor improvements. Thanks for the reviews Vladimir and Serguei! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23684#issuecomment-2675336648 From cjplummer at openjdk.org Fri Feb 21 19:12:00 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Fri, 21 Feb 2025 19:12:00 GMT Subject: Integrated: 8350287: Cleanup SA's support for CodeBlob subclasses In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 02:08:28 GMT, Chris Plummer wrote: > There is a lot of subclassing of CodeBlob types done in SA to mimic hotspot, but most of it is unnecessary. The generic CodeBlob class can handle all support needed by most of the subclasses. The only subclasses we need to keep around around NMethod, RuntimeStub, and UpcallStub, since they all have special support in SA. I also kept around RuntimeBlob so RuntimeStub can continue to inherit from it and be consistent with hotspot, but it's not actually necessary, and I'm more than happy to remove it also. > > I also cleaned up the PStack support for CodeBlobs. It can just use CodeBlob.getName() rather than trying to figure out the type of the CodeBlob instance to print out type name. This allows us to get rid of most isXXX() APIs. It also provides more useful output in some cases. > > There is some minor loss of functionality in some of the CodeBlob subtypes I removed. For example this is what AdapterBlob.getName() looked like (it is now gone): > > > public String getName() { > return "AdapterBlob: " + super.getName(); > } > > > So now we just use the default CodeBlob.getName(), which is what super.getName() would up execute. I think for AdapterBlob this always returns "I2C/C2I adapters", so now you only get this rather than "AdapterBlob: I2C/C2I adapters". We have a similar loss of getName() detail with MethodHandlesAdapterBlob (now returns "MethodHandles adapters") and VtableBlob (now returns "vtable chunks"). Basically for these 3 CodeBlob types getName() will no longer include the CodeBlob type. I could special case them in CodeBlob.getName() by fetching the kind to determine what the proper name should be. Let me know if you think it is worth it. This pull request has now been integrated. Changeset: b45c32cd Author: Chris Plummer URL: https://git.openjdk.org/jdk/commit/b45c32cd4fb55fac4fc5161b9cd76415c69b203b Stats: 698 lines in 15 files changed: 13 ins; 680 del; 5 mod 8350287: Cleanup SA's support for CodeBlob subclasses Reviewed-by: kvn, sspitsyn ------------- PR: https://git.openjdk.org/jdk/pull/23684 From jkarthikeyan at openjdk.org Fri Feb 21 19:13:55 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 21 Feb 2025 19:13:55 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 15:57:30 GMT, Marc Chevalier wrote: > This collapses double shift lefts by constants in a single constant: (x << con1) << con2 => x << (con1 + con2). Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. > > Moreover, the simplification logic of the sign extension trick had to be improved. For instance, we use `(x << 16) >> 16` to convert a 32 bits into a 16 bits integer, with sign extension. When storing this into a 16-bit field, this can be simplified into simple `x`. But in the case where `x` is itself a left-shift expression, say `y << 3`, this PR makes the IR looks like `(y << 19) >> 16` instead of the old `((y << 3) << 16) >> 16`. The former logic didn't handle the case where the left and the right shift have different magnitude. In this PR, I generalize this simplification to cases where the left shift has a larger magnitude than the right shift. This improvement was needed not to miss vectorization opportunities: without the simplification, we have a left shift and a right shift instead of a single left shift, which confuses the type inference. > > This also works for multiplications by powers of 2 since they are already translated into shifts. > > Thanks, > Marc This is a nice improvement! I'm glad to see that the limitation of Store/Shift folding requiring both shifts to have the same constant is being fixed. I've left some comments here. I also noticed that in `LShiftINode::Ideal`, there is a transform that mentions not breaking `i2s` and `i2b` patterns. It might be worth looking separately to see if this condition can be relaxed because of the change to `StoreNode::Ideal_sign_extended_input`. src/hotspot/share/opto/memnode.cpp line 3574: > 3572: // StoreB ... (RShiftI _ (LShiftI _ (LShiftI _ valIn (conIL - conIR)) conIR ) conIR) > 3573: Node* StoreNode::Ideal_sign_extended_input(PhaseGVN* phase, int num_rejected_bits) { > 3574: Node *val = in(MemNode::ValueIn); It might be good to clean up the other 4 lines in this function to match current style guidelines while you're updating it. src/hotspot/share/opto/mulnode.cpp line 981: > 979: // con0 is assumed to be masked already (as computed by maskShiftAmount) and non-zero > 980: // bt must be T_LONG or T_INT. > 981: static Node* collapseDoubleShiftLeft(PhaseGVN* phase, Node* outer_shift, int con0, BasicType bt) { >From the style guide, functions and local variables are named with `snake_case`. Maybe it could be named `collapse_left_shifts`. src/hotspot/share/opto/mulnode.cpp line 986: > 984: Node* inner_shift = outer_shift->in(1); > 985: int inner_shift_op = inner_shift->Opcode(); > 986: if (inner_shift_op != Op_LShift(bt)) { Since the local variable is otherwise unused, it'd be simpler to do: Suggestion: if (inner_shift->Opcode() != Op_LShift(bt)) { src/hotspot/share/opto/mulnode.cpp line 996: > 994: > 995: if (con0 + con1 >= nbits) { > 996: return ConNode::make(TypeInteger::zero(bt)); It'd be clearer to do this, which is more equivalent but more concise: Suggestion: return phase->zerocon(bt); src/hotspot/share/opto/mulnode.cpp line 1018: > 1016: // constant, flatten the tree: (X+con1)< X< 1017: // > 1018: // (X << con1) << con2 ==> X << (con1 + con2) (see collapseDoubleShiftLeft for details) I think it would be better to move this comment where `collapseDoubleShiftLeft` is called in the Ideal function, and same for `LShiftLNode::Ideal`. ------------- PR Review: https://git.openjdk.org/jdk/pull/23728#pullrequestreview-2633935484 PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1966017369 PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1966021220 PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1966018379 PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1966004662 PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1966006840 From dlong at openjdk.org Fri Feb 21 20:53:53 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 21 Feb 2025 20:53:53 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 08:43:30 GMT, Dmitry Chuyko wrote: >> The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. >> >> COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Full #if condition Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23682#pullrequestreview-2634192503 From kvn at openjdk.org Fri Feb 21 20:58:54 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Feb 2025 20:58:54 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 08:43:30 GMT, Dmitry Chuyko wrote: >> The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. >> >> COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Full #if condition Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23682#pullrequestreview-2634201847 From dchuyko at openjdk.org Fri Feb 21 21:46:56 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Fri, 21 Feb 2025 21:46:56 GMT Subject: RFR: 8350258: AArch64: Client build fails after JDK-8347917 [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 08:43:30 GMT, Dmitry Chuyko wrote: >> The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. >> >> COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. > > Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision: > > Full #if condition Dean, Vladimir, thanks for reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23682#issuecomment-2675608188 From dchuyko at openjdk.org Fri Feb 21 21:46:57 2025 From: dchuyko at openjdk.org (Dmitry Chuyko) Date: Fri, 21 Feb 2025 21:46:57 GMT Subject: Integrated: 8350258: AArch64: Client build fails after JDK-8347917 In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 22:42:18 GMT, Dmitry Chuyko wrote: > The location for rfp should be set in in the register map. In particular, it wasn't set in frame::sender_for_interpreter_frame() if neither C2 nor JVMCI were included. > > COMPILER1_OR_COMPILER2 condition is used instead of COMPILER2_OR_JVMCI, which also covers INCLUDE_JVMCI case. This pull request has now been integrated. Changeset: 25322aae Author: Dmitry Chuyko URL: https://git.openjdk.org/jdk/commit/25322aae8e224680db376098d2e45f26cf3334a0 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8350258: AArch64: Client build fails after JDK-8347917 Reviewed-by: dlong, kvn ------------- PR: https://git.openjdk.org/jdk/pull/23682 From chagedorn at openjdk.org Fri Feb 21 22:31:15 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 21 Feb 2025 22:31:15 GMT Subject: RFR: 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 [v2] In-Reply-To: References: Message-ID: > In the refactoring for [JDK-8344035](https://bugs.openjdk.org/browse/JDK-8344035), the value passed for the `rewire_uncommon_proj_phi_inputs` parameter in `PhaseIdealLoop::create_new_if_for_predicate()` during Loop Unswitching was accidentally flipped. It should only be set to `true` when calling it for a false-path loop, which is the cloned loop. This is currently not the case and leads to a bad graph due to folding nodes wrongly: > https://github.com/openjdk/jdk/blob/735805d9259037ae594eb4f75e96860d43feea5d/src/hotspot/share/opto/predicates.cpp#L84-L88 > > I fixed this by just flipping the parameter from `is_true_path_loop` to `is_false_path_loop` to avoid a negation. I added an additional comment to `PhaseIdealLoop::create_new_if_for_predicate()` about `rewire_uncommon_proj_phi_inputs`. > > More background about why we need `rewire_uncommon_proj_phi_inputs` in the first place can be found in the corresponding fix for [JDK-8290850](https://bugs.openjdk.org/browse/JDK-8290850): https://github.com/openjdk/jdk/pull/11452, and additionally in https://github.com/openjdk/jdk/pull/5185 > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: restore comment line length ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23712/files - new: https://git.openjdk.org/jdk/pull/23712/files/b225477e..8c0f57b3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23712&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23712&range=00-01 Stats: 6 lines in 1 file changed: 2 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23712.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23712/head:pull/23712 PR: https://git.openjdk.org/jdk/pull/23712 From chagedorn at openjdk.org Fri Feb 21 22:31:15 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 21 Feb 2025 22:31:15 GMT Subject: RFR: 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 [v2] In-Reply-To: <02kXOy7t4Xgo14xQVlrBj_HsBuGapMnXVSZ-rLlJ4S4=.6802a90e-8ca8-4f7c-8a88-b0c3b9cd9403@github.com> References: <02kXOy7t4Xgo14xQVlrBj_HsBuGapMnXVSZ-rLlJ4S4=.6802a90e-8ca8-4f7c-8a88-b0c3b9cd9403@github.com> Message-ID: On Thu, 20 Feb 2025 17:53:28 GMT, Vladimir Kozlov wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> restore comment line length > > src/hotspot/share/opto/loopPredicate.cpp line 100: > >> 98: // new_iff is returned which is an IfTrue projection. This code is also used to clone predicates to cloned loops. >> 99: // 'rewire_uncommon_proj_phi_inputs' should be set to the non-default value 'true' when called for a false-path loop >> 100: // during Loop Unswitching. > > Just nitpick. Can you return back length of comment's lines? Yes sure, done in new commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23712#discussion_r1966278487 From dlong at openjdk.org Fri Feb 21 23:07:53 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 21 Feb 2025 23:07:53 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 15:57:30 GMT, Marc Chevalier wrote: > Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. So `1 << 33` and `1 << 30 << 3` are still treated differently, according to the JVM spec? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23728#issuecomment-2675792350 From kvn at openjdk.org Fri Feb 21 23:20:58 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Feb 2025 23:20:58 GMT Subject: RFR: 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 22:31:15 GMT, Christian Hagedorn wrote: >> In the refactoring for [JDK-8344035](https://bugs.openjdk.org/browse/JDK-8344035), the value passed for the `rewire_uncommon_proj_phi_inputs` parameter in `PhaseIdealLoop::create_new_if_for_predicate()` during Loop Unswitching was accidentally flipped. It should only be set to `true` when calling it for a false-path loop, which is the cloned loop. This is currently not the case and leads to a bad graph due to folding nodes wrongly: >> https://github.com/openjdk/jdk/blob/735805d9259037ae594eb4f75e96860d43feea5d/src/hotspot/share/opto/predicates.cpp#L84-L88 >> >> I fixed this by just flipping the parameter from `is_true_path_loop` to `is_false_path_loop` to avoid a negation. I added an additional comment to `PhaseIdealLoop::create_new_if_for_predicate()` about `rewire_uncommon_proj_phi_inputs`. >> >> More background about why we need `rewire_uncommon_proj_phi_inputs` in the first place can be found in the corresponding fix for [JDK-8290850](https://bugs.openjdk.org/browse/JDK-8290850): https://github.com/openjdk/jdk/pull/11452, and additionally in https://github.com/openjdk/jdk/pull/5185 >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > restore comment line length Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23712#pullrequestreview-2634459527 From fyang at openjdk.org Sat Feb 22 10:19:06 2025 From: fyang at openjdk.org (Fei Yang) Date: Sat, 22 Feb 2025 10:19:06 GMT Subject: RFR: 8350480: RISC-V: Relax assertion about registers in C2_MacroAssembler::minmax_fp In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 10:33:04 GMT, Hamlin Li wrote: >> Hi, please review this trivial change. >> The current assertion about the registers is more than needed. >> It requires that `dst`, `src1` and `src2` must be different from each other. >> But the code only required that `dst` must be different from `src1` and `src2`. >> Patch simply relaxes the assersion removing the unneeded constraint. >> fastdebug builds OK with change. > > Looks good. @Hamlin-Li @feilongjiang : Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23723#issuecomment-2676133745 From fyang at openjdk.org Sat Feb 22 10:19:06 2025 From: fyang at openjdk.org (Fei Yang) Date: Sat, 22 Feb 2025 10:19:06 GMT Subject: Integrated: 8350480: RISC-V: Relax assertion about registers in C2_MacroAssembler::minmax_fp In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 06:36:52 GMT, Fei Yang wrote: > Hi, please review this trivial change. > The current assertion about the registers is more than needed. > It requires that `dst`, `src1` and `src2` must be different from each other. > But the code only required that `dst` must be different from `src1` and `src2`. > Patch simply relaxes the assersion removing the unneeded constraint. > fastdebug builds OK with change. This pull request has now been integrated. Changeset: a8916308 Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/a891630817844c8c42994da3b3110925ca4595a0 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8350480: RISC-V: Relax assertion about registers in C2_MacroAssembler::minmax_fp Reviewed-by: mli, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/23723 From duke at openjdk.org Sun Feb 23 04:14:04 2025 From: duke at openjdk.org (kuaiwei) Date: Sun, 23 Feb 2025 04:14:04 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v11] In-Reply-To: References: Message-ID: On Tue, 28 Jan 2025 03:09:28 GMT, kuaiwei wrote: >> This patch enhance MergeStores optimization to support merge value with reverse byte order. >> >> Below is benchmark result before and after the patch: >> >> On aliyun g8y (aarch64) >> |name | before | score2 | ratio | >> |---|---|---|---| >> |MergeStoreBench.setCharBS |5669.655000 |5669.566000 | 0.00 %| >> |MergeStoreBench.setCharBV |5516.911000 |5516.273000 | 0.01 %| >> |MergeStoreBench.setCharC |5578.644000 |5552.809000 | 0.47 %| >> |MergeStoreBench.setCharLS |5782.140000 |5779.264000 | 0.05 %| >> |MergeStoreBench.setCharLV |5496.403000 |5499.195000 | -0.05 %| >> |MergeStoreBench.setIntB |6087.703000 |2768.385000 | 119.90 %| >> |MergeStoreBench.setIntBU |6733.813000 |2950.240000 | 128.25 %| >> |MergeStoreBench.setIntBV |1362.233000 |1361.821000 | 0.03 %| >> |MergeStoreBench.setIntL |2834.785000 |2833.042000 | 0.06 %| >> |MergeStoreBench.setIntLU |2947.145000 |2946.874000 | 0.01 %| >> |MergeStoreBench.setIntLV |5506.791000 |5506.229000 | 0.01 %| >> |MergeStoreBench.setIntRB |7634.279000 |5611.058000 | 36.06 %| >> |MergeStoreBench.setIntRBU |7766.737000 |5551.281000 | 39.91 %| >> |MergeStoreBench.setIntRL |5689.793000 |5689.385000 | 0.01 %| >> |MergeStoreBench.setIntRLU |5628.287000 |5628.789000 | -0.01 %| >> |MergeStoreBench.setIntRU |5536.039000 |5534.910000 | 0.02 %| >> |MergeStoreBench.setIntU |5595.363000 |5567.810000 | 0.49 %| >> |MergeStoreBench.setLongB |13722.671000 |6811.098000 | 101.48 %| >> |MergeStoreBench.setLongBU |13728.844000 |4280.240000 | 220.75 %| >> |MergeStoreBench.setLongBV |2785.255000 |2785.949000 | -0.02 %| >> |MergeStoreBench.setLongL |5714.615000 |5710.402000 | 0.07 %| >> |MergeStoreBench.setLongLU |4128.746000 |4129.324000 | -0.01 %| >> |MergeStoreBench.setLongLV |2793.125000 |2794.438000 | -0.05 %| >> |MergeStoreBench.setLongRB |14465.223000 |7015.050000 | 106.20 %| >> |MergeStoreBench.setLongRBU |14546.954000 |6173.210000 | 135.65 %| >> |MergeStoreBench.setLongRL |6816.145000 |6813.348000 | 0.04 %| >> |MergeStoreBench.setLongRLU |4289.445000 |4284.239000 | 0.12 %| >> |MergeStoreBench.setLongRU |3132.471000 |3133.093000 | -0.02 %| >> |MergeStoreBench.setLongU |3086.779000 |3087.298000 | -0.02 %| >> >> AMD EPYC 9T24 >> ... > > kuaiwei has updated the pull request incrementally with three additional commits since the last revision: > > - Allow ValueOrder::Reverse on big-endian platforms > - Revert "Merge more stores" > > This reverts commit 1e1113ed02ec5a9fe181f215d5667e8de487fe47. > - Revert "Fix test502aBE" > > This reverts commit f773fa368577c4f67957c4d40968c5c45e3ae205. Keep it alive. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2676554804 From duke at openjdk.org Sun Feb 23 09:29:52 2025 From: duke at openjdk.org (Tobias Hotz) Date: Sun, 23 Feb 2025 09:29:52 GMT Subject: RFR: 8349563: Improve AbsNode::Value() for integer types In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 05:10:04 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a small patch that improves the implementation of Value() for `AbsINode` and `AbsLNode` by returning the absolute value of the input range. Most of the logic is trivial except for the special case where `_lo == jint_min/jlong_min` which must return the entire type range when encountered, for which I've added a small proof in the comments. I've also added some unit tests and updated the file to limit IR check platforms with more granularity. > > Thoughts and reviews would be appreciated! test/hotspot/jtreg/compiler/c2/irTests/TestIRAbs.java line 295: > 293: public boolean testIntRange3(int in) { > 294: // [-9, -2] => [2, 9] > 295: return Math.abs(-((in & 7) + 2)) < 2; Not sure if this is in scope for this PR, but `abs(x)` should be idealized into `0 - x` if x <= 0. This seems to be missing at the moment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23685#discussion_r1966721899 From fyang at openjdk.org Mon Feb 24 03:34:57 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 24 Feb 2025 03:34:57 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v3] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 14:48:12 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> Currently, `string_compare` code is a bit complicated, main reasons include: >> 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. >> 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. >> >> This is not good for code reading and maintaining. >> >> >> So, this patch does following refactoring: >> 1. merge LU and UL code into one, i.e. remove UL code. >> 2. seperate the code into 2 methods: LL/UU and LU/UL. >> 3. some other misc improvement. >> >> I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. >> 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. >> 2. make `SHORT_STRING` case simpler. >> >> >> >> Thanks > > Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - fix UL and test > - Merge branch 'master' into refactor-string-compare > - minor > - fix temp registers; move code > - blank lines > - simplify > - clean > - merge UL and LU > - move to functions > - move alignment code of LL&UU down from common code path > - ... and 1 more: https://git.openjdk.org/jdk/compare/70ff57cd...4f5ae272 Thanks for the update. A couple of comments remain. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1605: > 1603: string_compare_long_same_encoding(result, str1, str2, > 1604: cnt1, cnt2, tmp1, tmp2, tmp3, > 1605: isLL, isLL ? base_offset1 : base_offset2, minCharsInWord, `base_offset` is only used by the two new assembler subroutines, so it's more reasonable to calculate there. And `minCharsInWord` is simple and I think this param could also be saved. This way the argument list will be shorter. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1611: > 1609: isLU ? str1 : str2, > 1610: isLU ? str2 : str1, > 1611: isUL, Both `isLU` and `isUL` are used to prepare the params. Personally, I prefer to keep it simple and only use `isLU` here. Using `isLU` will also be kind of consistent with the order of the second and third params of this subroutine (renamed param `isUL` to `IsLU`): void C2_MacroAssembler::string_compare_long_different_encoding(Register result, Register strL, Register strU, bool isLU, Register cnt1, Register cnt2, src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1618: > 1616: } > 1617: > 1618: bind(STUB); Consider moving this STUB generation code to the new subroutines at the same time. ------------- PR Review: https://git.openjdk.org/jdk/pull/23633#pullrequestreview-2635951599 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967002792 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967004147 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967006761 From jbhateja at openjdk.org Mon Feb 24 05:24:00 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Feb 2025 05:24:00 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v26] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: <-MTKNvpPIWaNfR5bIgKVFs8zG3ulpUWPZQf6P5EhV5M=.3ed20f4b-6701-4e42-8299-7b805089b146@github.com> On Thu, 20 Feb 2025 09:47:50 GMT, Jatin Bhateja wrote: >> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. >> Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. >> >> Following are the performance stats for JMH micro included with the patch. >> >> >> Granite Rapids (P-core Xeon Server) >> Baseline : >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms >> >> Sierra Forest (E-core Xeon Server) >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.com... > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 > - Safety assertion added > - Review resolutions > - Lowering feature check to IR annotation level > - Adding missed feature check > - Review comments resolutions. > - Modifed scheme not based over fragile node level flags base solution. > - Updating comments for clarity > - Adding a missed check to skip over commoning of predicated vector operations > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 > - ... and 10 more: https://git.openjdk.org/jdk/compare/1e87ff01...acb613da Hi @TobiHartmann , @iwanowww , Can you kindly approve the patch its waiting for one more approval. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2677461426 From thartmann at openjdk.org Mon Feb 24 06:07:58 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Feb 2025 06:07:58 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 15:41:34 GMT, Marc Chevalier wrote: >> Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Fix and use replace_with_con Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2636068801 From vlivanov at openjdk.org Mon Feb 24 06:34:05 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 24 Feb 2025 06:34:05 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v26] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: On Thu, 20 Feb 2025 09:47:50 GMT, Jatin Bhateja wrote: >> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. >> Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. >> >> Following are the performance stats for JMH micro included with the patch. >> >> >> Granite Rapids (P-core Xeon Server) >> Baseline : >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms >> >> Sierra Forest (E-core Xeon Server) >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms >> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score Error Units >> VectorCommutativeOperSharingBenchmark.com... > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 > - Safety assertion added > - Review resolutions > - Lowering feature check to IR annotation level > - Adding missed feature check > - Review comments resolutions. > - Modifed scheme not based over fragile node level flags base solution. > - Updating comments for clarity > - Adding a missed check to skip over commoning of predicated vector operations > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 > - ... and 10 more: https://git.openjdk.org/jdk/compare/1e87ff01...acb613da Looks even better! src/hotspot/share/opto/vectornode.cpp line 1101: > 1099: > 1100: // Sort inputs of commutative non-predicated vector operations to help value numbering. > 1101: if (should_swap_inputs_to_help_global_value_numbering()) { It reads way too verbose to me. I'd just shape it as: // Sort inputs of commutative vector operations to help value numbering. if (is_commutative()) { if (in(1)->_idx > in(2)->_idx) { swap_edges(1, 2); } } ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22863#pullrequestreview-2636095051 PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1967092266 From chagedorn at openjdk.org Mon Feb 24 06:48:53 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 24 Feb 2025 06:48:53 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 15:41:34 GMT, Marc Chevalier wrote: >> Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Fix and use replace_with_con Would be nice to also have an IR test where `drem/frem` only becomes useless in a later compile phase (i.e. first used after parsing and then becomes unused at a later point). Otherwise, looks good to me, too. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2636112098 From epeter at openjdk.org Mon Feb 24 07:25:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 07:25:59 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov wrote: >> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti ng benchmark results there. >> >> Does that sound ok? >> >>> Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > >> > Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > > You should not worry about `-Xcomp` it is testing flag - we can use some default there. > I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes. > > I will look on changes more later. @vnkozlov I'll think about the "stall" vs "delay" suggestion. > How profitable (performance wise) to optimize slow path loop? Can we skip any optimizations for it - treat it as not-Counted? I suppose that depends on if the slow path loop will be taken. Imagine we are working on some unaligned MemorySegment (or with aliasing runtime-checks failing). In these cases without optimizing we would for example not unroll. But unrolling can give quite the speedup, of course at the cost of more compile time and code size. Also some RangeCheck eliminations only happen if you have a pre-main-post loop structure. There are probably other optimizations as well. So yes, if the slow path loop is taken often, then optimizing is probably worth it. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2677607527 From chagedorn at openjdk.org Mon Feb 24 07:30:58 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 24 Feb 2025 07:30:58 GMT Subject: Integrated: 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 12:18:57 GMT, Christian Hagedorn wrote: > In the refactoring for [JDK-8344035](https://bugs.openjdk.org/browse/JDK-8344035), the value passed for the `rewire_uncommon_proj_phi_inputs` parameter in `PhaseIdealLoop::create_new_if_for_predicate()` during Loop Unswitching was accidentally flipped. It should only be set to `true` when calling it for a false-path loop, which is the cloned loop. This is currently not the case and leads to a bad graph due to folding nodes wrongly: > https://github.com/openjdk/jdk/blob/735805d9259037ae594eb4f75e96860d43feea5d/src/hotspot/share/opto/predicates.cpp#L84-L88 > > I fixed this by just flipping the parameter from `is_true_path_loop` to `is_false_path_loop` to avoid a negation. I added an additional comment to `PhaseIdealLoop::create_new_if_for_predicate()` about `rewire_uncommon_proj_phi_inputs`. > > More background about why we need `rewire_uncommon_proj_phi_inputs` in the first place can be found in the corresponding fix for [JDK-8290850](https://bugs.openjdk.org/browse/JDK-8290850): https://github.com/openjdk/jdk/pull/11452, and additionally in https://github.com/openjdk/jdk/pull/5185 > > Thanks, > Christian This pull request has now been integrated. Changeset: a5c9a4db Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/a5c9a4dbde410c687f05951b8f1d3cf72fcaedc0 Stats: 88 lines in 5 files changed: 72 ins; 0 del; 16 mod 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/23712 From chagedorn at openjdk.org Mon Feb 24 07:30:57 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 24 Feb 2025 07:30:57 GMT Subject: RFR: 8349032: C2: Parse Predicate refactoring in Loop Unswitching broke fix for JDK-8290850 [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 22:31:15 GMT, Christian Hagedorn wrote: >> In the refactoring for [JDK-8344035](https://bugs.openjdk.org/browse/JDK-8344035), the value passed for the `rewire_uncommon_proj_phi_inputs` parameter in `PhaseIdealLoop::create_new_if_for_predicate()` during Loop Unswitching was accidentally flipped. It should only be set to `true` when calling it for a false-path loop, which is the cloned loop. This is currently not the case and leads to a bad graph due to folding nodes wrongly: >> https://github.com/openjdk/jdk/blob/735805d9259037ae594eb4f75e96860d43feea5d/src/hotspot/share/opto/predicates.cpp#L84-L88 >> >> I fixed this by just flipping the parameter from `is_true_path_loop` to `is_false_path_loop` to avoid a negation. I added an additional comment to `PhaseIdealLoop::create_new_if_for_predicate()` about `rewire_uncommon_proj_phi_inputs`. >> >> More background about why we need `rewire_uncommon_proj_phi_inputs` in the first place can be found in the corresponding fix for [JDK-8290850](https://bugs.openjdk.org/browse/JDK-8290850): https://github.com/openjdk/jdk/pull/11452, and additionally in https://github.com/openjdk/jdk/pull/5185 >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > restore comment line length Thanks Tobias and Vladimir for your reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23712#issuecomment-2677614598 From xgong at openjdk.org Mon Feb 24 07:42:57 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 24 Feb 2025 07:42:57 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 08:49:16 GMT, Bhavana Kilambi wrote: >> @Bhavana-Kilambi , left shift can not get right indexes here as values `0x2, 0x4` is landed in each B lane. Maybe we can just try with `bsl` for D size types, as it has only two lanes for long/double types with 128-bit vector length. > > Hi @XiaohongGong , thanks but bsl instruction only has 8B/16B types. not D type. I'll see how I can do this with bsl. Yes, `bsl` only accepts 8B/16B, but it can also work for other types. We need to keep all bits of the lane to 1/0 (e.g. `[0xffffffffffffffff, 0x0000000000000000]` for `T2D` type). You can take the implementation of `VectorBlend` as a reference. BTW, I'm currently working on adding the vector rearrange support for 2D (i.e. 128-bit long/double vector) types, and I met the same issues. I have tested that using a pattern with `bsl` can implement the op. The main idea is 1) compare the shuffle input with an iota index vector, and 2) choose `src` input or `swap two elements in src` based on the comparing result with `bsl`. Hope this could help you! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1967146405 From jbhateja at openjdk.org Mon Feb 24 07:43:06 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Feb 2025 07:43:06 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v18] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> <90MwDac7Q83dK8KDagHOst15xV-quGZKVE8n2tP9dsk=.351ed042-9a69-4186-b134-8c3cb6fef6cd@github.com> Message-ID: On Thu, 13 Feb 2025 09:23:54 GMT, Emanuel Peter wrote: >> Hi @eme64 , All comments addressed, looking forward to your approval > > @jatin-bhateja Perfect, it looks good now. Let me run testing one more time just to be sure. Please ping me in a day or so for the results! Thanks @eme64 , @iwanowww , @sviswa7 for your approvals. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22863#issuecomment-2677635442 From jbhateja at openjdk.org Mon Feb 24 07:43:09 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Feb 2025 07:43:09 GMT Subject: RFR: 8342393: Promote commutative vector IR node sharing [v26] In-Reply-To: References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: On Mon, 24 Feb 2025 06:31:07 GMT, Vladimir Ivanov wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 >> - Safety assertion added >> - Review resolutions >> - Lowering feature check to IR annotation level >> - Adding missed feature check >> - Review comments resolutions. >> - Modifed scheme not based over fragile node level flags base solution. >> - Updating comments for clarity >> - Adding a missed check to skip over commoning of predicated vector operations >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342393 >> - ... and 10 more: https://git.openjdk.org/jdk/compare/1e87ff01...acb613da > > src/hotspot/share/opto/vectornode.cpp line 1101: > >> 1099: >> 1100: // Sort inputs of commutative non-predicated vector operations to help value numbering. >> 1101: if (should_swap_inputs_to_help_global_value_numbering()) { > > It reads way too verbose to me. > > I'd just shape it as: > > // Sort inputs of commutative vector operations to help value numbering. > if (is_commutative()) { > if (in(1)->_idx > in(2)->_idx) { > swap_edges(1, 2); > } > } It was modified based on @eme64 suggestion, https://github.com/openjdk/jdk/pull/22863#discussion_r1942531523 I am inclined to go with what we have to save another approval cycle :-) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1967145676 From jbhateja at openjdk.org Mon Feb 24 07:43:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Feb 2025 07:43:11 GMT Subject: Integrated: 8342393: Promote commutative vector IR node sharing In-Reply-To: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> References: <7TzEoPWnq71MZZOzF_HBXr59hMAX_eNgu12ouhjalm8=.0dd0ce75-ecdb-4e58-86b4-82fb04eceea8@github.com> Message-ID: On Mon, 23 Dec 2024 11:28:46 GMT, Jatin Bhateja wrote: > Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering. > Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization. > > Following are the performance stats for JMH micro included with the patch. > > > Granite Rapids (P-core Xeon Server) > Baseline : > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 8982.549 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 6072.773 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2368.856 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 15215.087 ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 11963.554 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 7036.088 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 2906.731 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 17148.131 ops/ms > > Sierra Forest (E-core Xeon Server) > Baseline: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 2444.359 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing 1024 thrpt 2 1710.256 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing 1024 thrpt 2 308.766 ops/ms > VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing 1024 thrpt 2 3902.179 ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score Error Units > VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing 1024 thrpt 2 3352.839 ... This pull request has now been integrated. Changeset: e410af00 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/e410af00e69587b86536b298b869ddc898fd9862 Stats: 789 lines in 4 files changed: 788 ins; 0 del; 1 mod 8342393: Promote commutative vector IR node sharing Reviewed-by: vlivanov, epeter, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/22863 From epeter at openjdk.org Mon Feb 24 08:03:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 08:03:59 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov wrote: >> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti ng benchmark results there. >> >> Does that sound ok? >> >>> Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > >> > Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > > You should not worry about `-Xcomp` it is testing flag - we can use some default there. > I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes. > > I will look on changes more later. @vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2677667789 From bkilambi at openjdk.org Mon Feb 24 09:23:53 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 24 Feb 2025 09:23:53 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: Message-ID: <-exSdNf1CuxqYL--Mi4-L1m2Gop9bPIvdgqQEpAUIeM=.5f4936a7-31d4-45b7-bddf-e973b3687c18@github.com> On Mon, 24 Feb 2025 07:40:44 GMT, Xiaohong Gong wrote: >> Hi @XiaohongGong , thanks but bsl instruction only has 8B/16B types. not D type. I'll see how I can do this with bsl. > > Yes, `bsl` only accepts 8B/16B, but it can also work for other types. We need to keep all bits of the lane to 1/0 (e.g. `[0xffffffffffffffff, 0x0000000000000000]` for `T2D` type). You can take the implementation of `VectorBlend` as a reference. > > BTW, I'm currently working on adding the vector rearrange support for 2D (i.e. 128-bit long/double vector) types, and I met the same issues. I have tested that using a pattern with `bsl` can implement the op. The main idea is 1) compare the shuffle input with an iota index vector, and 2) choose `src` input or `swap two elements in src` based on the comparing result with `bsl`. Hope this could help you! Thank you for your inputs. I'll look into this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1967262718 From thartmann at openjdk.org Mon Feb 24 09:37:04 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Feb 2025 09:37:04 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: <83jq5rJ72L30aVfKQB9eWgyFlsTz6wGFTr8uW7hV8AE=.00b5c762-0bf7-4bb7-b0ae-4da63a0703f6@github.com> References: <83jq5rJ72L30aVfKQB9eWgyFlsTz6wGFTr8uW7hV8AE=.00b5c762-0bf7-4bb7-b0ae-4da63a0703f6@github.com> Message-ID: <4Ui7r42Q0l-rPn-fGfa1NFcMWqu_2AcAdlPq1Sm7eCg=.8b2f98cc-b1ca-4412-ac20-31086aad5985@github.com> On Mon, 17 Feb 2025 12:18:23 GMT, Matthias Ernst wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> incorporate @eme64's comment suggestions > > Super, glad this worked out! I want to return the compliment, thanks for sticking with me :-) This caused a regression in CCP: [JDK-8350563](https://bugs.openjdk.org/browse/JDK-8350563) @mernst-github, could you please have a look? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2677861258 From mli at openjdk.org Mon Feb 24 10:05:33 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 24 Feb 2025 10:05:33 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v4] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > Currently, `string_compare` code is a bit complicated, main reasons include: > 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. > 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. > > This is not good for code reading and maintaining. > > > So, this patch does following refactoring: > 1. merge LU and UL code into one, i.e. remove UL code. > 2. seperate the code into 2 methods: LL/UU and LU/UL. > 3. some other misc improvement. > > I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. > 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. > 2. make `SHORT_STRING` case simpler. > > > > Thanks Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: clean ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23633/files - new: https://git.openjdk.org/jdk/pull/23633/files/4f5ae272..fe0efa0e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=02-03 Stats: 52 lines in 2 files changed: 20 ins; 18 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/23633.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23633/head:pull/23633 PR: https://git.openjdk.org/jdk/pull/23633 From mli at openjdk.org Mon Feb 24 10:05:35 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 24 Feb 2025 10:05:35 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v3] In-Reply-To: References: Message-ID: <5Gyhh_ooTgJ0r_u1ihnDtVu_PPgkWsZXgNDyKfq4aEw=.73fc06cb-b8dc-470a-9d19-1734d3395738@github.com> On Mon, 24 Feb 2025 03:17:03 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: >> >> - fix UL and test >> - Merge branch 'master' into refactor-string-compare >> - minor >> - fix temp registers; move code >> - blank lines >> - simplify >> - clean >> - merge UL and LU >> - move to functions >> - move alignment code of LL&UU down from common code path >> - ... and 1 more: https://git.openjdk.org/jdk/compare/e85abfd8...4f5ae272 > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1605: > >> 1603: string_compare_long_same_encoding(result, str1, str2, >> 1604: cnt1, cnt2, tmp1, tmp2, tmp3, >> 1605: isLL, isLL ? base_offset1 : base_offset2, minCharsInWord, > > `base_offset` is only used by the two new assembler subroutines, so it's more reasonable to calculate there. And `minCharsInWord` is simple and I think this param could also be saved. This way the argument list will be shorter. OK, will fix. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1611: > >> 1609: isLU ? str1 : str2, >> 1610: isLU ? str2 : str1, >> 1611: isUL, > > Both `isLU` and `isUL` are used to prepare the params. Personally, I prefer to keep it simple and only use `isLU` here. Using `isLU` will also be kind of consistent with the order of the second and third params of this subroutine (renamed param `isUL` to `IsLU`): > > void C2_MacroAssembler::string_compare_long_different_encoding(Register result, Register strL, Register strU, > bool isLU, Register cnt1, Register cnt2, > > > -------- > PS: Why not pass `str1` and `str1` directly like you do for `string_compare_long_same_encoding`? We can distinuish `strL` and and `strU` in `string_compare_long_different_encoding` with param `isLU`. Seems to me that it will be simpler at the interface level then. OK, will fix. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1618: > >> 1616: } >> 1617: >> 1618: bind(STUB); > > Consider moving this STUB generation code to the new subroutines at the same time. It's better to keep it as it is, they are different logics, and if moving it then the 2 new subroutines will need to access the SHORT_* Labels by passing more parameters. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967323976 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967324194 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967326015 From duke at openjdk.org Mon Feb 24 10:08:08 2025 From: duke at openjdk.org (Matthias Ernst) Date: Mon, 24 Feb 2025 10:08:08 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions Sorry for that. Super-quick look: I have no idea what CCP is, but it appears that it contains redundant logic and knows what kind of optimizations are performed elsewhere. https://github.com/openjdk/jdk21u-dev/blob/fa896c742b494520e6c652539e8218b7127e877a/src/hotspot/share/opto/phaseX.cpp#L1972 looks salient, just from pattern matching I wonder if it's sufficient to add Op_ConI/L to this list. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2677936991 From duke at openjdk.org Mon Feb 24 10:12:05 2025 From: duke at openjdk.org (Matthias Ernst) Date: Mon, 24 Feb 2025 10:12:05 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: <3HA22sflRRtxeShTXizFQaOvEec3lwiw4O_H5BrTH30=.4afd9e4e-ab64-4098-b76e-a5778e8c7776@github.com> On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions Another place that seems to understand the Shift-And pattern: https://github.com/openjdk/jdk21u-dev/blob/master/src/hotspot/share/opto/phaseX.cpp#L1611 ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2677949084 From duke at openjdk.org Mon Feb 24 10:44:53 2025 From: duke at openjdk.org (Marc Chevalier) Date: Mon, 24 Feb 2025 10:44:53 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 23:05:26 GMT, Dean Long wrote: >> This collapses double shift lefts by constants in a single constant: (x << con1) << con2 => x << (con1 + con2). Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. >> >> Moreover, the simplification logic of the sign extension trick had to be improved. For instance, we use `(x << 16) >> 16` to convert a 32 bits into a 16 bits integer, with sign extension. When storing this into a 16-bit field, this can be simplified into simple `x`. But in the case where `x` is itself a left-shift expression, say `y << 3`, this PR makes the IR looks like `(y << 19) >> 16` instead of the old `((y << 3) << 16) >> 16`. The former logic didn't handle the case where the left and the right shift have different magnitude. In this PR, I generalize this simplification to cases where the left shift has a larger magnitude than the right shift. This improvement was needed not to miss vectorization opportunities: without the simplification, we have a left shift and a right shift instead of a single left shift, which confuses the type inference. >> >> This also works for multiplications by powers of 2 since they are already translated into shifts. >> >> Thanks, >> Marc > >> Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. > > So `1 << 33` and `1 << 30 << 3` are still treated differently, according to the JVM spec? @dean-long Yes! 1 << 33 was already, and is still, transformed into 1 << 1, while 1 << 30 << 3 is NOT transformed into 1 << 33 but directly into 0. The second part is exhibited by this test: @Test @IR(failOn = {IRNode.LSHIFT}) // Checks (x << 31) << 1 => 0 public int testDoubleShift3(int x) { return (x << 31) << 1; } (and a few similar other). I didn't add a test for the simple `1 << 33` since my code doesn't kick in unless there are 2 shifts, so nothing should have changed here. I can add such a test if you think it's needed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23728#issuecomment-2678032006 From fyang at openjdk.org Mon Feb 24 10:52:52 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 24 Feb 2025 10:52:52 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v3] In-Reply-To: <5Gyhh_ooTgJ0r_u1ihnDtVu_PPgkWsZXgNDyKfq4aEw=.73fc06cb-b8dc-470a-9d19-1734d3395738@github.com> References: <5Gyhh_ooTgJ0r_u1ihnDtVu_PPgkWsZXgNDyKfq4aEw=.73fc06cb-b8dc-470a-9d19-1734d3395738@github.com> Message-ID: On Mon, 24 Feb 2025 10:01:25 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1618: >> >>> 1616: } >>> 1617: >>> 1618: bind(STUB); >> >> Consider moving this STUB generation code to the new subroutines at the same time. > > It's better to keep it as it is, they are different logics, and if moving it then the 2 new subroutines will need to access the SHORT_* Labels by passing more parameters. Make sense. Will take a look at the latest version. Thanks for the update. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967400128 From roland at openjdk.org Mon Feb 24 12:54:56 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 24 Feb 2025 12:54:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Thu, 20 Feb 2025 09:44:16 GMT, Roland Westrelin wrote: >>> Do you see any better way than having the 2x code size if we need both a slow and fast loop? >> >> No but I was confused by your comment about 3x and 4x which is why I asked for clarification. >> Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said. > >> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. >> >> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. > > Do you understand when that happens? It doesn't feel right that the pre loop can be lost. > @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. Yes, if not too much work. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678332801 From roland at openjdk.org Mon Feb 24 13:08:21 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 24 Feb 2025 13:08:21 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v11] In-Reply-To: References: Message-ID: > To optimize a long counted loop and long range checks in a long or int > counted loop, the loop is turned into a loop nest. When the loop has > few iterations, the overhead of having an outer loop whose backedge is > never taken, has a measurable cost. Furthermore, creating the loop > nest usually causes one iteration of the loop to be peeled so > predicates can be set up. If the loop is short running, then it's an > extra iteration that's run with range checks (compared to an int > counted loop with int range checks). > > This change doesn't create a loop nest when: > > 1- it can be determined statically at loop nest creation time that the > loop runs for a short enough number of iterations > > 2- profiling reports that the loop runs for no more than ShortLoopIter > iterations (1000 by default). > > For 2-, a guard is added which is implemented as yet another predicate. > > While this change is in principle simple, I ran into a few > implementation issues: > > - while c2 has a way to compute the number of iterations of an int > counted loop, it doesn't have that for long counted loop. The > existing logic for int counted loops promotes values to long to > avoid overflows. I reworked it so it now works for both long and int > counted loops. > > - I added a new deoptimization reason (Reason_short_running_loop) for > the new predicate. Given the number of iterations is narrowed down > by the predicate, the limit of the loop after transformation is a > cast node that's control dependent on the short running loop > predicate. Because once the counted loop is transformed, it is > likely that range check predicates will be inserted and they will > depend on the limit, the short running loop predicate has to be the > one that's further away from the loop entry. Now it is also possible > that the limit before transformation depends on a predicate > (TestShortRunningLongCountedLoopPredicatesClone is an example), we > can have: new predicates inserted after the transformation that > depend on the casted limit that itself depend on old predicates > added before the transformation. To solve this cicular dependency, > parse and assert predicates are cloned between the old predicates > and the loop head. The cloned short running loop parse predicate is > the one that's used to insert the short running loop predicate. > > - In the case of a long counted loop, the loop is transformed into a > regular loop with a new limit and transformed range checks that's > later turned into an in counted loop. The int ... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: - Merge branch 'master' into JDK-8342692 - whitespace - Merge branch 'master' into JDK-8342692 - TestMemorySegment test fix - test wip - Merge branch 'master' into JDK-8342692 - refactor - Merge branch 'master' into JDK-8342692 - Merge branch 'master' into JDK-8342692 - Merge branch 'master' into JDK-8342692 - ... and 25 more: https://git.openjdk.org/jdk/compare/e1d0a9c8...f68a7ca6 ------------- Changes: https://git.openjdk.org/jdk/pull/21630/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=10 Stats: 1316 lines in 25 files changed: 1254 ins; 16 del; 46 mod Patch: https://git.openjdk.org/jdk/pull/21630.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21630/head:pull/21630 PR: https://git.openjdk.org/jdk/pull/21630 From roland at openjdk.org Mon Feb 24 13:09:55 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 24 Feb 2025 13:09:55 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v7] In-Reply-To: References: <3cT_HJ9dj5J4NFrLzmvYUdUy4uee6Ltcm6d20YP3jm0=.aa20c25e-c097-4e59-9d82-12aa2c3b4422@github.com> Message-ID: <7_L1cQ2Vfw8Oo7K7m7Mbh1bIhvYgruFJPbMi5WLhZR8=.fe128071-33af-472f-a22b-c8c949945f49@github.com> On Thu, 13 Feb 2025 10:46:25 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > @rwestrel nice work, looks like a good step to unify the code a little! > > I left some comments / suggestions. > > I'm also wondering about testing. How good do you think test coverage is? Are all cases covered? How about the edge-cases? Could we improve the coverage with randomization somehow? @eme64 can you take another look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23438#issuecomment-2678367268 From thartmann at openjdk.org Mon Feb 24 13:49:04 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Feb 2025 13:49:04 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: <8hnzze01DGFPCaM_JYa2Yek_RDC-O5b6eAZMkU_ICTY=.08844769-0a51-405a-9fef-a76568817625@github.com> On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions Yes, both CCP and IGVN need special logic to make sure that a node A is added to the worklist if another node B that is not a direct input to B changed such that node A could now be further optimized. You added such an optimization but the code to add the node to the worklist is missing. It's exactly the two places you referred to. I think it should be possible to create a test for IGVN as well, i.e. come up with a case where the optimization is not performed although it could. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2678462872 From epeter at openjdk.org Mon Feb 24 14:32:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 14:32:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin wrote: > > @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. > > Yes, if not too much work. Ok, let's add this: diff --git a/src/hotspot/share/opto/vectorization.cpp b/src/hotspot/share/opto/vectorization.cpp index e607a1065dd..290ee249a42 100644 --- a/src/hotspot/share/opto/vectorization.cpp +++ b/src/hotspot/share/opto/vectorization.cpp @@ -98,6 +98,7 @@ VStatus VLoop::check_preconditions_helper() { // the pre-loop limit. CountedLoopEndNode* pre_end = _cl->find_pre_loop_end(); if (pre_end == nullptr) { + assert(false, "found no pre-loop"); return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT); } Node* pre_opaq1 = pre_end->limit(); And run that: rr /oracle-work/jdk-fork7/build/linux-x64-slowdebug/jdk/bin/java -Xcomp -XX:+TraceLoopOpts -XX:CompileCommand=compileonly,jdk.internal.classfile.impl.StackMapGenerator::processBlock --version .... PreMainPost Loop: N7127/N4014 limit_check profile_predicated predicated counted [0,int),+1 (2147483648 iters) rc has_sfpt strip_mined Unroll 2 Loop: N7127/N4014 counted [int,int),+1 (2147483648 iters) main rc has_sfpt strip_mined Loop: N0/N0 has_call has_sfpt Loop: N7453/N7460 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre rc has_sfpt Loop: N7126/N7125 sfpts={ 7128 } Loop: N7508/N4014 counted [int,int),+2 (2147483648 iters) main rc has_sfpt strip_mined Loop: N7409/N7416 counted [int,int),+1 (4 iters) post rc has_sfpt Parallel IV: 7728 Loop: N7453/N7460 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre has_sfpt Parallel IV: 7725 Loop: N7508/N4014 counted [int,int),+2 (2147483648 iters) main has_sfpt strip_mined Parallel IV: 7718 Loop: N7409/N7416 counted [int,int),+1 (4 iters) post has_sfpt Loop: N0/N0 has_call has_sfpt Loop: N7453/N7460 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre has_sfpt Loop: N7126/N7125 sfpts={ 7128 } Loop: N7508/N4014 counted [int,int),+2 (2147483648 iters) main has_sfpt strip_mined Loop: N7409/N7416 counted [int,int),+1 (4 iters) post has_sfpt RangeCheck Loop: N7508/N4014 counted [int,int),+2 (2147483648 iters) main has_sfpt rce strip_mined Unroll 4 Loop: N7508/N4014 limit_check counted [int,int),+2 (2147483648 iters) main has_sfpt rce strip_mined Loop: N0/N0 has_call has_sfpt Loop: N7453/N7460 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre rc has_sfpt Loop: N7126/N7125 limit_check sfpts={ 7128 } Loop: N8146/N4014 limit_check counted [int,int),+4 (2147483648 iters) main has_sfpt strip_mined Loop: N7409/N7416 counted [int,int),+1 (4 iters) post rc has_sfpt ... # Internal Error (/oracle-work/jdk-fork7/open/src/hotspot/share/opto/vectorization.cpp:101), pid=1381339, tid=1381348 # assert(false) failed: found no pre-loop The pre-loop node is not dead actually. The issue is with the main-loop in `CountedLoopNode::is_canonical_loop_entry`. We skip through some predicates, but then we cannot find the ZeroTripGuard, rather I'm seeing this: (rr) p ctrl->dump_bfs(2,0,"#cd") dist dump --------------------------------------------- 2 974 ConI === 0 [[ ... ]] #int:1 2 8060 IfTrue === 8056 [[ 8073 ]] #1 1 8073 If === 8060 974 [[ 8074 8077 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 0 8077 IfTrue === 8073 [[ 8103 ]] #1 The pre-loop is further up though: (rr) p this->dump_bfs(26,0,"#c") dist dump --------------------------------------------- 26 7453 CountedLoop === 7453 4015 7460 [[ 7452 7453 7454 7455 ]] inner stride: 1 pre of N7127 !orig=[7127],[7118],[2645] !jvms: StackMapGenerator::processBlock @ bci:2677 (line 671) 25 7455 If === 7453 7441 [[ 7456 7464 ]] P=0.000001, C=-1.000000 !orig=[2686] !jvms: StackMapGenerator$Frame::popStack @ bci:5 (line 1001) StackMapGenerator::processBlock @ bci:2681 (line 671) 24 7456 IfFalse === 7455 [[ 7448 7457 ]] #0 !orig=[2631],[2628] !jvms: StackMapGenerator$Frame::popStack @ bci:5 (line 1001) StackMapGenerator::processBlock @ bci:2681 (line 671) 23 7457 RangeCheck === 7456 7446 [[ 7458 7467 ]] P=0.999999, C=-1.000000 !orig=[1189] !jvms: StackMapGenerator$Frame::popStack @ bci:33 (line 1002) StackMapGenerator::processBlock @ bci:2681 (line 671) 22 7458 IfTrue === 7457 [[ 7459 ]] #1 !orig=[777],385 !jvms: StackMapGenerator$Frame::popStack @ bci:33 (line 1002) StackMapGenerator::processBlock @ bci:2681 (line 671) 21 7459 CountedLoopEnd === 7458 7443 [[ 7460 7482 ]] [lt] P=0.900000, C=-1.000000 !orig=7122,[5398] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670) 20 7482 IfFalse === 7459 [[ 7486 ]] #0 19 7486 If === 7482 7485 [[ 7461 7487 ]] P=0.999999, C=-1.000000 18 7487 IfTrue === 7486 [[ 7977 ]] #1 17 7977 If === 7487 974 [[ 7978 7981 ]] #Init Value Assertion Predicate P=0.999999, C=-1.000000 16 7981 IfTrue === 7977 [[ 7994 ]] #1 15 7994 If === 7981 974 [[ 7995 7998 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 14 7998 IfTrue === 7994 [[ 8118 ]] #1 13 8118 If === 7998 8117 [[ 8119 8122 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 12 8122 IfTrue === 8118 [[ 8007 ]] #1 11 8007 If === 8122 8006 [[ 8008 8011 ]] #Init Value Assertion Predicate P=0.999999, C=-1.000000 10 8011 IfTrue === 8007 [[ 8056 ]] #1 9 8056 If === 8011 974 [[ 8057 8060 ]] #Init Value Assertion Predicate P=0.999999, C=-1.000000 8 8060 IfTrue === 8056 [[ 8073 ]] #1 7 8073 If === 8060 974 [[ 8074 8077 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 6 8077 IfTrue === 8073 [[ 8103 ]] #1 5 8173 IfFalse === 7122 [[ 7128 7129 ]] #0 !orig=[7524],[7123],[5442] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670) 5 8103 If === 8077 8102 [[ 8104 8107 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 4 7128 SafePoint === 8173 1 778 1 1 7129 780 1 1 781 781 782 783 784 1 1 1 785 786 [[ 7124 ]] SafePoint !orig=385 !jvms: StackMapGenerator::processBlock @ bci:2688 (line 670) 4 8107 IfTrue === 8103 [[ 8086 ]] #1 3 7124 OuterStripMinedLoopEnd === 7128 781 [[ 7125 7471 ]] P=0.900000, C=-1.000000 3 8086 If === 8107 8085 [[ 8087 8090 ]] #Init Value Assertion Predicate P=0.999999, C=-1.000000 2 7122 CountedLoopEnd === 8146 7121 [[ 8173 4014 ]] [lt] P=0.900000, C=-1.000000 !orig=[5398] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670) 2 7125 IfTrue === 7124 [[ 7126 ]] #1 2 8090 IfTrue === 8086 [[ 7126 ]] #1 1 4014 IfTrue === 7122 [[ 8146 ]] #1 !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670) 1 7126 OuterStripMinedLoop === 7126 8090 7125 [[ 7126 8146 ]] 0 8146 CountedLoop === 8146 7126 4014 [[ 8146 1191 8157 8158 7122 7503 ]] inner stride: 4 main of N8146 strip mined !orig=[7508],[7127],[7118],[2645] !jvms: StackMapGenerator::processBlock @ bci:2677 (line 671) It looks like we are skipping some predicates, but not enough of them maybe? In `AssertionPredicates::find_entry` we see: - `8090 IfTrue === 8086 [[ 7126 ]] #1`: `is_predicate` returns `true`. - `8107 IfTrue === 8103 [[ 8086 ]] #1`: `is_predicate` returns `true`. - `8077 IfTrue === 8073 [[ 8103 ]] #1`: `is_predicate` returns `false`. The reason is that the assertion predicate Opaque nodes have already disappeared. I talked with @chhagedorn and he says that there are some "dying" initialized assertion predicates from unrolling that can be in the way. They would be cleaned out by IGVN later, and then we can see through. But at this point they are in the way and we cannot see through and find the ZeroTripGuard, the predicate iterator is not good enough yet. But @chhagedorn is working on that. https://bugs.openjdk.org/browse/JDK-8350579 The implication is that the ZeroTripGuard can be temporarily not be found, and so we cannot even find the pre-loop, and also not the multiversion-if. So I cannot really add an assert now. And who knows, there may be other blocking reasons on top of that. @rwestrel Does that make sense? What do you think we should do? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678602660 From fyang at openjdk.org Mon Feb 24 14:51:01 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 24 Feb 2025 14:51:01 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v4] In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 10:05:33 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> Currently, `string_compare` code is a bit complicated, main reasons include: >> 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. >> 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. >> >> This is not good for code reading and maintaining. >> >> >> So, this patch does following refactoring: >> 1. merge LU and UL code into one, i.e. remove UL code. >> 2. seperate the code into 2 methods: LL/UU and LU/UL. >> 3. some other misc improvement. >> >> I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. >> 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. >> 2. make `SHORT_STRING` case simpler. >> >> >> >> Thanks > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > clean src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1398: > 1396: assert((base_offset2 % (UseCompactObjectHeaders ? 4 : > 1397: (UseCompressedClassPointers ? 8 : 4))) == 0, "Must be"); > 1398: const int base_offset = isLL ? base_offset1 : base_offset2; Since only one of `base_offset1` and `base_offset2` will be used, better to do: const int base_offset = isLL ? arrayOopDesc::base_offset_in_bytes(T_BYTE) : arrayOopDesc::base_offset_in_bytes(T_CHAR); assert((base_offset % (UseCompactObjectHeaders ? 4 : (UseCompressedClassPointers ? 8 : 4))) == 0, "Must be"); src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1452: > 1450: > 1451: bne(tmp1, tmp2, DIFFERENCE); > 1452: bltz(cnt2, NEXT_WORD); As these two instructions belong the main loop, I think it's cleaner to add proper indentation for them and group them with other instructions in the loop together. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1526: > 1524: > 1525: bne(tmpL, tmpU, DIFFERENCE); > 1526: bltz(cnt2, NEXT_WORD); Similar here. Do you mind adding proper indentation for these instructions and group them with other instructions in the main loop together. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967474451 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967478342 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1967480109 From mli at openjdk.org Mon Feb 24 15:11:29 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 24 Feb 2025 15:11:29 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v5] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > Currently, `string_compare` code is a bit complicated, main reasons include: > 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. > 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. > > This is not good for code reading and maintaining. > > > So, this patch does following refactoring: > 1. merge LU and UL code into one, i.e. remove UL code. > 2. seperate the code into 2 methods: LL/UU and LU/UL. > 3. some other misc improvement. > > I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. > 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. > 2. make `SHORT_STRING` case simpler. > > > > Thanks Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: clean 2 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23633/files - new: https://git.openjdk.org/jdk/pull/23633/files/fe0efa0e..d67a23e3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=03-04 Stats: 13 lines in 1 file changed: 0 ins; 5 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23633.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23633/head:pull/23633 PR: https://git.openjdk.org/jdk/pull/23633 From epeter at openjdk.org Mon Feb 24 15:30:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 15:30:07 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin wrote: >>> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. >>> >>> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. >> >> Do you understand when that happens? It doesn't feel right that the pre loop can be lost. > >> @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. > > Yes, if not too much work. @rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678803600 From epeter at openjdk.org Mon Feb 24 15:46:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 15:46:09 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> Message-ID: On Fri, 14 Feb 2025 15:57:54 GMT, Roland Westrelin wrote: >> This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and >> `Value` because the `int` and `long` versions are very similar and so >> there's no logic duplication. In the process, support for some extra >> transformations is added to `RShiftL`. I also added some new test >> cases. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: > > - review > - review > - review > - Merge branch 'master' into JDK-8349361 > - Update src/hotspot/share/opto/mulnode.cpp > > Co-authored-by: Emanuel Peter > - Update src/hotspot/share/opto/mulnode.cpp > > Co-authored-by: Emanuel Peter > - review > - Update src/hotspot/share/opto/mulnode.hpp > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> > - Update src/hotspot/share/opto/mulnode.cpp > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> > - Update src/hotspot/share/opto/mulnode.cpp > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> > - ... and 5 more: https://git.openjdk.org/jdk/compare/4ffbd68b...5b05d222 src/hotspot/share/opto/mulnode.cpp line 1351: > 1349: const Node* and_node = in(1); > 1350: if (and_node->Opcode() == Op_And(bt) && > 1351: (mask_t = phase->type(and_node->in(2))->isa_integer(bt)) && Is this an implicit null check? Style guide: > Do not use ints or pointers as (implicit) booleans with &&, ||, if, while. Instead, compare explicitly, i.e. if (x != 0) or if (ptr != nullptr), etc. src/hotspot/share/opto/mulnode.cpp line 1415: > 1413: > 1414: return nullptr; > 1415: } This is the same code as above, right? As commented above: it would be good to not move it and reduce the size of the diff. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1967856470 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1967904048 From epeter at openjdk.org Mon Feb 24 15:46:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 15:46:09 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> Message-ID: On Mon, 24 Feb 2025 15:18:43 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: >> >> - review >> - review >> - review >> - Merge branch 'master' into JDK-8349361 >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Emanuel Peter >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Emanuel Peter >> - review >> - Update src/hotspot/share/opto/mulnode.hpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - ... and 5 more: https://git.openjdk.org/jdk/compare/4ffbd68b...5b05d222 > > src/hotspot/share/opto/mulnode.cpp line 1351: > >> 1349: const Node* and_node = in(1); >> 1350: if (and_node->Opcode() == Op_And(bt) && >> 1351: (mask_t = phase->type(and_node->in(2))->isa_integer(bt)) && > > Is this an implicit null check? > > Style guide: >> Do not use ints or pointers as (implicit) booleans with &&, ||, if, while. Instead, compare explicitly, i.e. if (x != 0) or if (ptr != nullptr), etc. Honestly, why not just split the `if` into 2 ifs. `const TypeInteger* mask_t;` That looks a little less than nice as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1967875020 From epeter at openjdk.org Mon Feb 24 15:46:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 15:46:09 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> Message-ID: <0eEPcfLV8p12YNXlG9it2o87jnSNrcMXhvZTEpliaTo=.361fa421-bb9f-4a05-b272-0d6845a930e9@github.com> On Mon, 24 Feb 2025 15:27:53 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/mulnode.cpp line 1351: >> >>> 1349: const Node* and_node = in(1); >>> 1350: if (and_node->Opcode() == Op_And(bt) && >>> 1351: (mask_t = phase->type(and_node->in(2))->isa_integer(bt)) && >> >> Is this an implicit null check? >> >> Style guide: >>> Do not use ints or pointers as (implicit) booleans with &&, ||, if, while. Instead, compare explicitly, i.e. if (x != 0) or if (ptr != nullptr), etc. > > Honestly, why not just split the `if` into 2 ifs. > `const TypeInteger* mask_t;` > That looks a little less than nice as well. Oh it seams to me this whole code was not really changed, it just appears moved in the diff. Maybe you can rearrange the order so the diff looks smaller, and we don't have to argue about the (questionable) code style here ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1967898812 From roland at openjdk.org Mon Feb 24 15:49:01 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 24 Feb 2025 15:49:01 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin wrote: >>> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. >>> >>> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. >> >> Do you understand when that happens? It doesn't feel right that the pre loop can be lost. > >> @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. > > Yes, if not too much work. > @rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed. That sounds reasonable to me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678873056 From epeter at openjdk.org Mon Feb 24 15:49:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 15:49:59 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> Message-ID: On Fri, 14 Feb 2025 15:57:54 GMT, Roland Westrelin wrote: >> This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and >> `Value` because the `int` and `long` versions are very similar and so >> there's no logic duplication. In the process, support for some extra >> transformations is added to `RShiftL`. I also added some new test >> cases. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: > > - review > - review > - review > - Merge branch 'master' into JDK-8349361 > - Update src/hotspot/share/opto/mulnode.cpp > > Co-authored-by: Emanuel Peter > - Update src/hotspot/share/opto/mulnode.cpp > > Co-authored-by: Emanuel Peter > - review > - Update src/hotspot/share/opto/mulnode.hpp > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> > - Update src/hotspot/share/opto/mulnode.cpp > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> > - Update src/hotspot/share/opto/mulnode.cpp > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> > - ... and 5 more: https://git.openjdk.org/jdk/compare/bdb3f04d...5b05d222 test/hotspot/jtreg/compiler/c2/irTests/RShiftLNodeIdealizationTests.java line 125: > 123: final static int test7Shift = RunInfo.getRandom().nextInt(32) + 32; > 124: final static long test7Min = -1L << (64 - test7Shift -1); > 125: final static long test7Max = ~test7Min; Would you mind adding a quick comment about why you chose the values the way you do? test/hotspot/jtreg/compiler/c2/irTests/RShiftLNodeIdealizationTests.java line 145: > 143: public long test9(long x) { > 144: x = Integer.max(Integer.min((int)x, (int)test7Max), (int)(test7Min-1)); > 145: return ((x << test7Shift) >> test7Shift); It could be nice to have some test cases where both shift values are completely randomized. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1967923729 PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1967922247 From kxu at openjdk.org Mon Feb 24 15:50:55 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 24 Feb 2025 15:50:55 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v2] In-Reply-To: References: Message-ID: <_QrrSpaW37aI0oSS4eJx0NE9KIdg-z3Fs-BUaKZIYD0=.50fe94ff-f8c2-487b-b17a-f4b2b70a330a@github.com> On Fri, 7 Feb 2025 00:17:16 GMT, Dean Long wrote: >> Kangcheng Xu has updated the pull request incrementally with two additional commits since the last revision: >> >> - use explicit argument types for overloaded java_shift_left() >> - use java_shift_left() > > It's not clear what result you are expecting from the new shift code when it overflows. It looks like signed overflow undefined behavior (UB) to me, which would ideally be caught by UBSAN, so you probably want to make sure your changes are ubsan-clean. If the desired result for overflow is 0, then I think java_shift_left() should be used. > @dean-long: It looks like signed overflow undefined behavior Yes, you are right. I shouldn't rely on this UB. Per your suggestion, I switched to `java_shift_left()` and `jlong()`. Hopefully, this makes my intention clearer. This PR is ready for review. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23506#issuecomment-2678877597 From never at openjdk.org Mon Feb 24 17:31:06 2025 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 24 Feb 2025 17:31:06 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 19 Feb 2025 00:37:14 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Stricter assertion on ppc64 src/hotspot/share/runtime/deoptimization.cpp line 650: > 648: // would need to get the size from the resolved method entry. Another exception would > 649: // be an invokedynamic with an adapter that is really a MethodHandle linker. > 650: caller_was_method_handle = true; This flag also controls the code at 711 that controls the computation of caller_adjustment. Is the new answer also correct for that code? This code might be a bit clearer if the computations of caller_was_method_handle, caller_adjustment and the new caller_actual_parameters were all closer together, though that might complicate a backport so maybe it should be deferred to some later cleanup. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1968100587 From duke at openjdk.org Mon Feb 24 18:22:59 2025 From: duke at openjdk.org (Johannes Graham) Date: Mon, 24 Feb 2025 18:22:59 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v27] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 08:43:39 GMT, Emanuel Peter wrote: >> Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: >> >> formatting, remove commented tests > > I also see that https://github.com/openjdk/jdk/pull/2776 and https://github.com/openjdk/jdk/pull/4136 were mentioned here. Both of those are related an have no IR tests of their own, yikes! We have to ensure that we cover those old cases, and then new ones here, so that we do not get any accidental regressions. > > Maybe that's all already covered in other existing tests or the tests you added. Can you please provide a summary of all tests and what cases they cover in the PR description? It would help a lot for reviewing. Hi, @eme64, do you have any additional comments on this PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23089#issuecomment-2679297323 From mbaesken at openjdk.org Mon Feb 24 18:35:31 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 24 Feb 2025 18:35:31 GMT Subject: RFR: 8350585: InlineSecondarySupersTest must be guarded on ppc64 by COMPILER2 Message-ID: In the minimal build we run into this error (e.g. on AIX) : src/hotspot/cpu/ppc/stubGenerator_ppc.cpp:4894:12: error: use of undeclared identifier 'InlineSecondarySupersTest' if (!InlineSecondarySupersTest) { The reason is the missing `COMPILER2` define check in the ppc64 coding. ------------- Commit messages: - JDK-8350585 Changes: https://git.openjdk.org/jdk/pull/23753/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23753&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350585 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23753.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23753/head:pull/23753 PR: https://git.openjdk.org/jdk/pull/23753 From lmesnik at openjdk.org Mon Feb 24 20:41:56 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Mon, 24 Feb 2025 20:41:56 GMT Subject: RFR: 8339889: Several compiler tests ignore vm flags and not marked as flagless [v2] In-Reply-To: <7w3Xg7S_9ruBGIBd6sZa_9byn11QMPKXFpVPziooI5U=.3a957c35-2a04-4b2c-becb-24b4b0fe9175@github.com> References: <5acZ_FmW23VeDgOFMiEuUa60TLxaOcC3wWZVwHFh8EU=.95188fc9-2f54-47af-a91c-4855db76f399@github.com> <7w3Xg7S_9ruBGIBd6sZa_9byn11QMPKXFpVPziooI5U=.3a957c35-2a04-4b2c-becb-24b4b0fe9175@github.com> Message-ID: <5nHsp5GGwdU0BOnKsZT-LUOKJO7I7RrllpsR-6coTd8=.993e1beb-0414-41b7-83c3-9c7cfd117ca2@github.com> On Mon, 17 Feb 2025 08:27:28 GMT, Damon Fenacci wrote: > Thanks for "cleaning this up" @lmesnik. > > I just ran a quick grep on `test/hotspot/jtreg/compiler` and noticed that there are a few more tests that use `ProcessTools.createLimitedTestJavaProcessBuilder` but don't have `vm.flagless` and don't seem to be covered by other JBS issues (e.g. `compiler/codecache/CheckLargePages.java`, `compiler/onSpinWait/TestOnSpinWaitAArch64DefaultFlags.java`, `compiler/jvmci/TestUncaughtErrorInCompileMethod.java` or `compiler/jvmci/compilerToVM/GetFlagValueTest.java`). Their main method runs in a new VM (`@run main/othervm`) but then they run other processes with `ProcessTools.createLimitedTestJavaProcessBuilder `. As I understand, vm flags would only affect the main method (which supposedly is not what is being tested). So, I was wondering if it made sense to mark them flagless as well anyway. Thanks, we have a list of bugs to fix: I filed https://bugs.openjdk.org/browse/JDK-8350603 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23224#issuecomment-2679592774 From dlong at openjdk.org Mon Feb 24 22:36:56 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 24 Feb 2025 22:36:56 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: <6I2PyXMG5jSH3dmfnmUvUOrtZ9ntwjkZEw2GqFFCsNg=.de6024a5-5c60-4842-afe5-3d878b65bb6c@github.com> On Mon, 24 Feb 2025 17:28:03 GMT, Tom Rodriguez wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Stricter assertion on ppc64 > > src/hotspot/share/runtime/deoptimization.cpp line 650: > >> 648: // would need to get the size from the resolved method entry. Another exception would >> 649: // be an invokedynamic with an adapter that is really a MethodHandle linker. >> 650: caller_was_method_handle = true; > > This flag also controls the code at 711 that controls the computation of caller_adjustment. Is the new answer also correct for that code? > > This code might be a bit clearer if the computations of caller_was_method_handle, caller_adjustment and the new caller_actual_parameters were all closer together, though that might complicate a backport so maybe it should be deferred to some later cleanup. Yes, I have further cleanup that I want to do later, but I want to minimize changes in this one to simplify backports. Good catch about line 711. I left it in on purpose, again to simplify backports, but it could be safely removed. All it does here is over-estimate the adjustment, which is harmless. In future cleanups, I hope to make the adjustment exact rather than a possibly over-estimated increment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1968511371 From dlong at openjdk.org Mon Feb 24 23:49:52 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 24 Feb 2025 23:49:52 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 10:42:25 GMT, Marc Chevalier wrote: > I didn't add a test for the simple `1 << 33` since my code doesn't kick in unless there are 2 shifts, so nothing should have changed here. I can add such a test if you think it's needed. I don't know if it's needed (we might have that covered in other tests), but adding now seems like a good idea for completeness. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23728#issuecomment-2679958314 From psandoz at openjdk.org Tue Feb 25 00:02:53 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Tue, 25 Feb 2025 00:02:53 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: <52HO_iL9asn1huCdJj82R1AwF1w8ON9HZetrdc9rQyQ=.28e137e0-a7f7-4839-a3e7-eda4f8a6c4f5@github.com> Message-ID: On Mon, 17 Feb 2025 06:40:11 GMT, Nicole Xu wrote: >> Please try with following command line >> `java -jar target/benchmarks.jar -f 1 -i 2 -wi 1 -w 30 -p ARRAYLEN=30 MaskedLogic` > > Thanks for pointing that out. Typically, ARRAYLEN is almost always a POT value, which is also assumed by many other benchmarks. Are we realistically going to test with an ARRAYLEN of 30? > > I think the POT assumption is reasonable for our purposes. It's a reasonable assumption. Since `ARRAYLEN` is a parameter of the benchmark we should enforce that constraint in benchmark initialization method, checking if the value is POT and failing otherwise. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22963#discussion_r1968595640 From kvn at openjdk.org Tue Feb 25 00:37:00 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Feb 2025 00:37:00 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> On Mon, 24 Feb 2025 08:00:24 GMT, Emanuel Peter wrote: > But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. Okay. PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680031423 From dhanalla at openjdk.org Tue Feb 25 01:02:02 2025 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Tue, 25 Feb 2025 01:02:02 GMT Subject: RFR: 8350609: cleanup unknown unwind opcode (0xB) for windows Message-ID: This PR is to clean-up unknown unwind opcodes (0xB) in Windows intrinsic functions introduced in commit https://github.com/openjdk/jdk17u-dev/commit/9f05c411e6d6bdf612cf0cf8b9fe4ca9ecde50d1#diff-a024df6bcd94607260545e647922261703a652dee1afadb1fa758f6e74a568d1 ![image](https://github.com/user-attachments/assets/5b295365-ba8e-4fd6-8b8b-f7243f80a496) According to the Windows unwind Opcodes outlined at https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170#unwind-operation-code, the opcode 0xB (1011) is not a valid Opcode, as the valid opcodes range from 0 to 10. ------------- Commit messages: - cleanup unknown unwind opcode (0xB) for windows Changes: https://git.openjdk.org/jdk/pull/23707/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23707&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350609 Stats: 112 lines in 22 files changed: 0 ins; 88 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/23707.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23707/head:pull/23707 PR: https://git.openjdk.org/jdk/pull/23707 From sviswanathan at openjdk.org Tue Feb 25 01:02:02 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 25 Feb 2025 01:02:02 GMT Subject: RFR: 8350609: cleanup unknown unwind opcode (0xB) for windows In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 03:58:17 GMT, Dhamoder Nalla wrote: > This PR is to clean-up unknown unwind opcodes (0xB) in Windows intrinsic functions introduced in commit https://github.com/openjdk/jdk17u-dev/commit/9f05c411e6d6bdf612cf0cf8b9fe4ca9ecde50d1#diff-a024df6bcd94607260545e647922261703a652dee1afadb1fa758f6e74a568d1 > > ![image](https://github.com/user-attachments/assets/5b295365-ba8e-4fd6-8b8b-f7243f80a496) > > According to the Windows unwind Opcodes outlined at https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170#unwind-operation-code, the opcode 0xB (1011) is not a valid Opcode, as the valid opcodes range from 0 to 10. Thanks for fixing this. It looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23707#pullrequestreview-2638832373 From xgong at openjdk.org Tue Feb 25 01:44:53 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 25 Feb 2025 01:44:53 GMT Subject: RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 01:47:10 GMT, Xiaohong Gong wrote: > Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures. > > The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture, with different UseSVE options. Here is the gain details: > > > Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2 > ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x > ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x > ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x > ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x > ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x > ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x > ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x > ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x > ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x > ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x > ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x > ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x > IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x > IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x > IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x > IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x > IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x > IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x > IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x > IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x > IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x > IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x > IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x > IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x > LongMaxVector.SADD 1024 thrpt 30 82.01x 81.74x 82.153x > LongMaxVector... @theRealAph , could you please help take a look at this PR? Thanks a lot in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23608#issuecomment-2680151515 From never at openjdk.org Tue Feb 25 02:03:57 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 25 Feb 2025 02:03:57 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: <6I2PyXMG5jSH3dmfnmUvUOrtZ9ntwjkZEw2GqFFCsNg=.de6024a5-5c60-4842-afe5-3d878b65bb6c@github.com> References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> <6I2PyXMG5jSH3dmfnmUvUOrtZ9ntwjkZEw2GqFFCsNg=.de6024a5-5c60-4842-afe5-3d878b65bb6c@github.com> Message-ID: On Mon, 24 Feb 2025 22:34:01 GMT, Dean Long wrote: >> src/hotspot/share/runtime/deoptimization.cpp line 650: >> >>> 648: // would need to get the size from the resolved method entry. Another exception would >>> 649: // be an invokedynamic with an adapter that is really a MethodHandle linker. >>> 650: caller_was_method_handle = true; >> >> This flag also controls the code at 711 that controls the computation of caller_adjustment. Is the new answer also correct for that code? >> >> This code might be a bit clearer if the computations of caller_was_method_handle, caller_adjustment and the new caller_actual_parameters were all closer together, though that might complicate a backport so maybe it should be deferred to some later cleanup. > > Yes, I have further cleanup that I want to do later, but I want to minimize changes in this one to simplify backports. > Good catch about line 711. I left it in on purpose, again to simplify backports, but it could be safely removed. All it does here is over-estimate the adjustment, which is harmless. In future cleanups, I hope to make the adjustment exact rather than a possibly over-estimated increment. Sounds good. I kind of assumed it was a benign oversizing. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1968699203 From never at openjdk.org Tue Feb 25 02:11:54 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 25 Feb 2025 02:11:54 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 19 Feb 2025 00:37:14 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Stricter assertion on ppc64 The new asserts look good and the logic seems right. ------------- Marked as reviewed by never (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23557#pullrequestreview-2638933792 From dlong at openjdk.org Tue Feb 25 02:33:54 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 25 Feb 2025 02:33:54 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v2] In-Reply-To: References: Message-ID: <3S81AfCTcqoDGUAStziUPAuo0UKFU28xY7lYBvz9cko=.ee1b07fc-8291-4b11-9765-1b478202663c@github.com> On Tue, 18 Feb 2025 19:27:22 GMT, Kangcheng Xu wrote: >> [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. >> >> When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) >> >> The following was implemented to address this issue. >> >> if (UseNewCode2) { >> *multiplier = bt == T_INT >> ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows >> : ((jlong) 1) << con->get_int(); >> } else { >> *multiplier = ((jlong) 1 << con->get_int()); >> } >> >> >> Two new bitshift overflow tests were added. > > Kangcheng Xu has updated the pull request incrementally with two additional commits since the last revision: > > - use explicit argument types for overloaded java_shift_left() > - use java_shift_left() You might want to have the reviewers from the original JDK-8325495 double check this. @rwestrel @chhagedorn ------------- PR Comment: https://git.openjdk.org/jdk/pull/23506#issuecomment-2680235803 From amitkumar at openjdk.org Tue Feb 25 02:58:51 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 25 Feb 2025 02:58:51 GMT Subject: RFR: 8350585: InlineSecondarySupersTest must be guarded on ppc64 by COMPILER2 In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 18:31:01 GMT, Matthias Baesken wrote: > In the minimal build we run into this error (e.g. on AIX) : > > src/hotspot/cpu/ppc/stubGenerator_ppc.cpp:4894:12: error: use of undeclared identifier 'InlineSecondarySupersTest' > if (!InlineSecondarySupersTest) { > > > The reason is the missing `COMPILER2` define check in the ppc64 coding. Looks good ------------- Marked as reviewed by amitkumar (Committer). PR Review: https://git.openjdk.org/jdk/pull/23753#pullrequestreview-2639018146 From fyang at openjdk.org Tue Feb 25 03:58:54 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 25 Feb 2025 03:58:54 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v5] In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 15:11:29 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> Currently, `string_compare` code is a bit complicated, main reasons include: >> 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. >> 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. >> >> This is not good for code reading and maintaining. >> >> >> So, this patch does following refactoring: >> 1. merge LU and UL code into one, i.e. remove UL code. >> 2. seperate the code into 2 methods: LL/UU and LU/UL. >> 3. some other misc improvement. >> >> I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. >> 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. >> 2. make `SHORT_STRING` case simpler. >> >> >> >> Thanks > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > clean 2 Do you mind several more tweaks? Looks good otherwise. Thanks. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1389: > 1387: const bool isLL, Register cnt1, Register cnt2, > 1388: Register tmp1, Register tmp2, Register tmp3, > 1389: const int STUB_THRESHOLD, Label *DONE, Label *STUB) { Personally I prefer to put together STUB_THRESHOLD and STUB, like : `const int STUB_THRESHOLD, Label *STUB, Label *DONE)`. Similar for `C2_MacroAssembler::string_compare_long_different_encoding`. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1434: > 1432: add(str1, str1, cnt2); > 1433: sub(cnt2, zr, cnt2); > 1434: Unnecessary new line. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1480: > 1478: void C2_MacroAssembler::string_compare_long_different_encoding(Register result, Register str1, Register str2, > 1479: bool isLU, Register cnt1, Register cnt2, > 1480: Register tmpL, Register tmpU, Register tmp3, There is a naming inconsistency for the params. The prototype in the header file says `Register tmp1, Register tmp2, Register tmp3,`. I think we can still use that naming here and create aliases like `tmpL` and `tmpU` as needed in this routine. Like: `Register tmpL = tmp1; Register tmpU = tmp2;` src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1505: > 1503: sub(cnt2, zr, cnt2); > 1504: addi(cnt1, cnt1, 4); > 1505: Unnecessary new line. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1596: > 1594: // Load 4 bytes once to compare for alignment before main loop. Note that this > 1595: // is only possible for LL/UU case. We need to resort to load_long_misaligned > 1596: // for both LU and UL cases. This code comment needs to be moved to the proper place. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2706: > 2704: __ bltz(cnt2, TAIL); > 2705: __ bind(SMALL_LOOP); > 2706: compare_string_16_bytes_same(DIFF, DIFF2, result, str1, cnt1, str2, tmp1, tmp2, tmp4, tmp5); Witnessed quite some implicit register dependencies here. And there is only one callsite of `compare_string_16_bytes_same`. Seems more readable if we inline its code here. ------------- PR Review: https://git.openjdk.org/jdk/pull/23633#pullrequestreview-2638980565 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1968812819 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1968789299 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1968741111 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1968789215 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1968729207 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1968726664 From dlong at openjdk.org Tue Feb 25 04:05:05 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 25 Feb 2025 04:05:05 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 19 Feb 2025 00:37:14 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Stricter assertion on ppc64 Thanks, Tom, for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23557#issuecomment-2680380837 From duke at openjdk.org Tue Feb 25 06:13:08 2025 From: duke at openjdk.org (Matthias Ernst) Date: Tue, 25 Feb 2025 06:13:08 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions Related: https://bugs.openjdk.org/browse/JDK-8288683 I'm still trying to understand: - why this is intermittent, it seems to me that it should trigger deterministically - what's the severity of this? do we need to back out or is this merely an indicator for a missed downstream optimization. 8288683 has a case where execution would be incorrect, also not sure I understand how that can happen. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2680722140 From thartmann at openjdk.org Tue Feb 25 07:05:53 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 25 Feb 2025 07:05:53 GMT Subject: RFR: 8339889: Several compiler tests ignore vm flags and not marked as flagless [v2] In-Reply-To: References: <5acZ_FmW23VeDgOFMiEuUa60TLxaOcC3wWZVwHFh8EU=.95188fc9-2f54-47af-a91c-4855db76f399@github.com> Message-ID: On Tue, 11 Feb 2025 22:43:35 GMT, Leonid Mesnik wrote: >> Tests >> compiler/c2/TestReduceAllocationAndHeapDump.java >> compiler/calls/NativeCalls.java >> compiler/debug/TestStress.java >> compiler/inlining/TestDuplicatedLateInliningOutput.java >> ignore vm flags using limited process builder and not marked as flagless. >> >> Please note that test >> compiler/inlining/TestDuplicatedLateInliningOutput.java >> is failing with some VM flags. See >> https://bugs.openjdk.org/browse/JDK-8348214 >> >> I haven't excluded test, since it fail with certain non-common flags only. > > Leonid Mesnik has updated the pull request incrementally with one additional commit since the last revision: > > test updated as suggested. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23224#pullrequestreview-2639749781 From epeter at openjdk.org Tue Feb 25 07:11:55 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 07:11:55 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> References: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> Message-ID: On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov wrote: > > But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. > > Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. > > Okay. Sounds good, we will revisit and write more benchmarks there. > > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680885496 From epeter at openjdk.org Tue Feb 25 07:15:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 07:15:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> References: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> Message-ID: On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov wrote: >> @vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree? > >> But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. > > Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. > > Okay. > > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. @vnkozlov @rwestrel Let me summarize the tasks left to do here: - Rename `stalled` -> `delayed`. And `unstall` -> `resume_optimizations` or alike. Improve some comments. - File follow-up RFE for more verification (must find multiversion-if from multiversioned loop) - currently blocked by predicate traversal issue. Maybe we can also assert that we can always find the pre-loop from the main-loop, at least during loop-opts. - When working on aliasing-analysis runtime-check, we have to do more performance analysis, and show the need of both the fast and slow path loops. Let me know if there is more ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680894298 From epeter at openjdk.org Tue Feb 25 07:26:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 07:26:07 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 06:09:54 GMT, Matthias Ernst wrote: > Related: https://bugs.openjdk.org/browse/JDK-8288683 > > I'm still trying to understand: > > * why this is intermittent, it seems to me that it should trigger deterministically Sometimes that is due to timing, i.e. differences in profiling, what code gets inlined etc. That can change the shape of the IR and the order in which it is processed. That can change the order in which things get folded and propagated, and so in some cases the bug triggers and in others not. It would be good to extract a smaller reproducer where it triggers more reliably. Here `StressIGVN` and `StressCCP` with `RepeatCompilation` can be your friends. > * what's the severity of this? do we need to back out or is this merely an indicator for a missed downstream optimization. 8288683 has a case where execution would be incorrect, also not sure I understand how that can happen. Let me explain: - In IGVN, we only narrow types. If you miss an optimization, then the result is suboptimal, but still correct. - In CCP, we start with empty types and widen them until the types are correct. If we miss a step, then the type is too narrow and the result is incorrect. Imagine a type is too narrow, and then we have some condition based on it. Imagine the range is wrongly `[0,5]` but should be `[0,10]`. If we check if the value could be `9`, we think it is `false`, but it could actually be `true`. We wrongly constant fold a condition and get wrong results. Not sure if any of this applies in that example, but it might. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2680916268 From epeter at openjdk.org Tue Feb 25 07:26:08 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 07:26:08 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Fri, 14 Feb 2025 07:18:52 GMT, Matthias Ernst wrote: >> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization. >> >> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`: >> >> >> (base + (index + 1) << 8) & 255 >> => MulNode >> (base + (index << 8 + 256)) & 255 >> => AddNode >> ((base + index << 8) + 256) & 255 >> >> >> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction: >> >> >> ((base + index << 8) + 256) & 255 >> => MulNode (this PR) >> (base + index << 8) & 255 >> => MulNode (PR #6697) >> base & 255 (loop invariant) >> >> >> Implementation notes: >> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR. >> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~ >> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~ > > Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: > > incorporate @eme64's comment suggestions About severity: As long as we find and integrate a fix during `JDK25` it is fine (the issue does not break the CI that badly at the moment). If we get close to rampdown, then we have to consider if we want to backout 8346664 or if we defer the bug to `JDK26`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2680920491 From epeter at openjdk.org Tue Feb 25 07:53:02 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 07:53:02 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v33] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 07:44:55 GMT, Emanuel Peter wrote: >> Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: >> >> update tests > > test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 334: > >> 332: var xor = (x & 0b111) ^ (y & 0b100); >> 333: return xor < 0b1000; >> 334: } > > These are nice simple examples, and we should keep them. But I'm missing these cases with randomization. > > `calc_xor_upper_bound_of_non_neg` basically has two input ranges `[0, hi_0]` and `[0, hi_1]`, and gives us a new range `[0,max]`. > > You could produce random input ranges like this: > > public void test(int x) { > int lo_x = con1; > int hi_x = con2; > x = x < lo_x ? lo_x : (x > hi_x ? hi_x : x); > // x clamped to [lo_x, hi_x] > int lo_y = con3; > int hi_y = con4; > y = y < lo_y ? lo_y : (y > hi_y ? hi_y : y); > // y clamped to [lo_y, hi_y] > int z = x ^ y; > // This should now have a new range, possibly some [0, max] > // Now let's test the range with some random if branches. > int sum = 0; > if (z > somecon1) { sum += 1; } > if (z > somecon2) { sum += 2; } > if (z > somecon3) { sum += 4; } > // maybe add a few more... > if (z > someconi) { sum += pow(2,i); } > return sum; > } > > The `sum` at the end gives you a summary over all the checks. If one wrongly constant folds, you'll be missing one of the power of 2 contributions to it, or have it wrongly added. > Now you do this with an `x` and a `y` > > Maybe that's a little over-engineered, but it would target the `calc_xor_upper_bound_of_non_neg` logic really well. > What do you think? FYI I'm working on fuzzing the compiler currently, so I'm thinking about how to write such tests more generally, that's why I took the time to come up with this. If you have better ideas, I'm more than happy to see them ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1969164548 From epeter at openjdk.org Tue Feb 25 07:53:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 07:53:01 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v33] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 16:25:48 GMT, Johannes Graham wrote: >> An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. >> >> In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: >> - Bounds optimization of xor >> - A check for `x ^ x = 0` >> - Explicit testing of xor over booleans. >> >> Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. >> >> --------- >> ### Progress >> - [x] Change must not contain extraneous whitespace >> - [x] Commit message must refer to an issue >> - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) >> >> >> >> ### Reviewers >> * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) >> >> ### Reviewing >>
Using git >> >> Checkout this PR locally: \ >> `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ >> `$ git checkout pull/23089` >> >> Update a local copy of the PR: \ >> `$ git checkout pull/23089` \ >> `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` >> >>
>>
Using Skara CLI tools >> >> Checkout this PR locally: \ >> `$ git pr checkout 23089` >> >> View PR using the GUI difftool: \ >> `$ git pr show -t 23089` >> >>
>>
Using diff file >> >> Download this PR as a diff file: \ >> https://git.openjdk.org/jdk/pull/23089.diff >> >>
>>
Using Webrev >> >> [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-25939... > > Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: > > update tests Thanks for the ping! Nice work, thanks for the improvements. I have a few more suggestions. src/hotspot/share/opto/addnode.cpp line 1079: > 1077: julong max = calc_xor_upper_bound_of_non_neg(r0->_hi, r1->_hi); > 1078: return TypeLong::make(0, max, MAX2(r0->_widen, r1->_widen)); > 1079: } Suggestion: if (r0->_lo >= 0 && r1->_lo >= 0) { // Combine [0, lo_1] ^ [0, hi_1] -> [0, max] julong max = calc_xor_upper_bound_of_non_neg(r0->_hi, r1->_hi); return TypeLong::make(0, max, MAX2(r0->_widen, r1->_widen)); } Some comment like this would be nice, it matches the `Constant fold` comment above. test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 236: > 234: public int testConstXor() { > 235: return CONST_1 ^ CONST_2; > 236: } Nice, that's a test for constant folding with random constants. test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 334: > 332: var xor = (x & 0b111) ^ (y & 0b100); > 333: return xor < 0b1000; > 334: } These are nice simple examples, and we should keep them. But I'm missing these cases with randomization. `calc_xor_upper_bound_of_non_neg` basically has two input ranges `[0, hi_0]` and `[0, hi_1]`, and gives us a new range `[0,max]`. You could produce random input ranges like this: public void test(int x) { int lo_x = con1; int hi_x = con2; x = x < lo_x ? lo_x : (x > hi_x ? hi_x : x); // x clamped to [lo_x, hi_x] int lo_y = con3; int hi_y = con4; y = y < lo_y ? lo_y : (y > hi_y ? hi_y : y); // y clamped to [lo_y, hi_y] int z = x ^ y; // This should now have a new range, possibly some [0, max] // Now let's test the range with some random if branches. int sum = 0; if (z > somecon1) { sum += 1; } if (z > somecon2) { sum += 2; } if (z > somecon3) { sum += 4; } // maybe add a few more... if (z > someconi) { sum += pow(2,i); } return sum; } The `sum` at the end gives you a summary over all the checks. If one wrongly constant folds, you'll be missing one of the power of 2 contributions to it, or have it wrongly added. Now you do this with an `x` and a `y` Maybe that's a little over-engineered, but it would target the `calc_xor_upper_bound_of_non_neg` logic really well. What do you think? ------------- PR Review: https://git.openjdk.org/jdk/pull/23089#pullrequestreview-2639832127 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1969162329 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1969127608 PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1969157206 From duke at openjdk.org Tue Feb 25 07:56:32 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 07:56:32 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v3] In-Reply-To: References: Message-ID: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Improve deletion of [fr]rem after parsing ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23694/files - new: https://git.openjdk.org/jdk/pull/23694/files/40af6f13..afd6f633 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=01-02 Stats: 94 lines in 6 files changed: 91 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23694.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23694/head:pull/23694 PR: https://git.openjdk.org/jdk/pull/23694 From duke at openjdk.org Tue Feb 25 08:00:10 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 08:00:10 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v4] In-Reply-To: References: Message-ID: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Marc Chevalier has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Improve deletion of [fr]rem after parsing ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23694/files - new: https://git.openjdk.org/jdk/pull/23694/files/afd6f633..8208f4af Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23694.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23694/head:pull/23694 PR: https://git.openjdk.org/jdk/pull/23694 From epeter at openjdk.org Tue Feb 25 08:14:05 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 08:14:05 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop [v2] In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 16:57:30 GMT, Roland Westrelin wrote: >> The test crashes because of a division by zero. The `Div` node for >> that one is initially part of a counted loop. The control input of the >> node is cleared because the divisor is non zero. This is because the >> divisor depends on the loop phi and the type of the loop phi is >> narrowed down when the counted loop is created. pre/main/post loops >> are created, unrolling happens, the main loop looses its backedge. The >> `Div` node can then float above the zero trip guard for the main >> loop. When the zero trip guard is not taken, there's no guarantee the >> divisor is non zero so the `Div` node should be pinned below it. >> >> I propose we revert the change I made with 8334724 which removed >> `PhaseIdealLoop::cast_incr_before_loop()`. The `CastII` that this >> method inserted was there to handle exactly this problem. It was added >> initially for a similar issue but with array loads. That problem with >> loads is handled some other way now and that's why I thought it was >> safe to proceed with the removal. >> >> The code in this patch is somewhat different from the one we had >> before for a couple reasons: >> >> 1- assert predicate code evolved and so previous logic can't be >> resurrected as it was. >> >> 2- the previous logic has a bug. >> >> Regarding 1-: during pre/main/post loop creation, we used to add the >> `CastII` and then to add assertion predicates (so assertion predicates >> depended on the `CastII`). Then when unrolling, when assertion >> predicates are updated, we would skip over the `CastII`. What I >> propose here is to add the `CastII` after assertion predicates are >> added. As a result, they don't depend on the `CastII` and there's no >> need for any extra logic when unrolling happens. This, however, >> doesn't work when the assertion predicates are added by RCE. In that >> case, I had to add logic to skip over the `CastII` (similar to what >> existed before I removed it). >> >> Regarding 2-: previous implementation for >> `PhaseIdealLoop::cast_incr_before_loop()` would add the `CastII` at >> the first loop `Phi` it encounters that's a use of the loop increment: >> it's usually the iv but not always. I tweaked the test case to show, >> this bug can actually cause a crash and changed the logic for >> `PhaseIdealLoop::cast_incr_before_loop()` accordingly. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge branch 'master' into JDK-8349139 > - fix & test Looks reasonable. I have a few questions / suggestions. I'm running some testing, please ping me again in a day or two ;) src/hotspot/share/opto/loopTransform.cpp line 1703: > 1701: } > 1702: // CastII for the new post loop: > 1703: cast_incr_before_loop(zer_opaq->in(1), zer_taken, post_head); I see it is added for the main and post loop. Why not for the pre loop? src/hotspot/share/opto/loopnode.cpp line 6091: > 6089: if (uncast && init->is_CastII()) { > 6090: // skip over the cast added by PhaseIdealLoop::cast_incr_before_loop() when pre/post/main loops are created because > 6091: // it can get in the way of type propagation I think it would be nice if you said more about how it can get in the way of type propagation. Why would we sometimes have `uncast` on and sometimes off? You may even have a quick comment about it at the use-site. test/hotspot/jtreg/compiler/controldependency/TestDivDependentOnMainLoopGuard.java line 50: > 48: i1 <<= i3; > 49: } while (++i2 < 68); > 50: for (i23 = 68; i23 > 2; otherPhi=i23-1, i23--) { This is essencially the test from https://github.com/openjdk/jdk/pull/3190, we only changed the `3` to a `2`. This indicates that we need to probably slightly generalize the test. Can we maybe randomize the constant, just to get a little better coverage? ------------- PR Review: https://git.openjdk.org/jdk/pull/23617#pullrequestreview-2639946128 PR Review Comment: https://git.openjdk.org/jdk/pull/23617#discussion_r1969207426 PR Review Comment: https://git.openjdk.org/jdk/pull/23617#discussion_r1969206376 PR Review Comment: https://git.openjdk.org/jdk/pull/23617#discussion_r1969183742 From epeter at openjdk.org Tue Feb 25 08:14:05 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 08:14:05 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop [v2] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 07:56:36 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: >> >> - Merge branch 'master' into JDK-8349139 >> - fix & test > > test/hotspot/jtreg/compiler/controldependency/TestDivDependentOnMainLoopGuard.java line 50: > >> 48: i1 <<= i3; >> 49: } while (++i2 < 68); >> 50: for (i23 = 68; i23 > 2; otherPhi=i23-1, i23--) { > > This is essencially the test from https://github.com/openjdk/jdk/pull/3190, we only changed the `3` to a `2`. This indicates that we need to probably slightly generalize the test. Can we maybe randomize the constant, just to get a little better coverage? Because the other test became ineffective, and it would be a shame if the same happened to this one ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23617#discussion_r1969185979 From epeter at openjdk.org Tue Feb 25 08:22:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 08:22:57 GMT Subject: RFR: 8349139: C2: Div looses dependency on condition that guarantees divisor not null in counted loop [v2] In-Reply-To: <52OYoC5__FdcN8OLwVgdNlb6Fz_IFo8UyKy3GUp5DiM=.708f1ee8-dbbb-4abf-8de0-d94b3b1e2ef6@github.com> References: <52OYoC5__FdcN8OLwVgdNlb6Fz_IFo8UyKy3GUp5DiM=.708f1ee8-dbbb-4abf-8de0-d94b3b1e2ef6@github.com> Message-ID: On Fri, 14 Feb 2025 18:24:25 GMT, Quan Anh Mai wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: >> >> - Merge branch 'master' into JDK-8349139 >> - fix & test > > Hmmm, may be you are right. I think adding a comment at `PhiNode` saying that people must not rely on it being pinned at the `Region` for dependencies would be a wise move, I can't think of any reason for that besides value narrowing right now but being pinned is a property of `Phi` regardless and we should tell people not to rely on this behaviour. > > For this bug, I think a more general fix is to try to compare the type of the `Phi` with that of the input it is going to be replaced with. If the former is not wider than the latter then we add a `CastNode`, since the cast is only about value range, not strict dependency, we can use `CarryDependency` instead of `UnconditionalDependency`. Am I right? Ah, I only just now read the comments from @merykitty and you. Oops. Hmm. Yes it seems that the `CountedLoop` trip `phi` is special. That's maybe not great to have such implicit assumptions laying around. But not sure what would have been the better alternative. @rwestrel > It reproduces this issue and is actually a better test case because it doesn't even need StressGCM: Are you saying this is another test? I'd really be happy if we had more tests for this case, because the current version seems fragile, since it is an almost perfect copy of a previous test that became ineffective this makes me even more nervous ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23617#issuecomment-2681081313 From duke at openjdk.org Tue Feb 25 09:03:07 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Tue, 25 Feb 2025 09:03:07 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2] In-Reply-To: References: Message-ID: On Thu, 25 Jan 2024 14:47:47 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with two additional commits since the last revision: > > - num_8b_elems_in_vec --> nof_vec_elems > - Removed checks for (MaxVectorSize >= 16) per @RealFYang suggestion. . ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-2681199754 From epeter at openjdk.org Tue Feb 25 09:27:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 09:27:13 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: Message-ID: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: - Merge branch 'master' into JDK-8323582-SW-native-alignment - stall -> delay, plus some more comments - adjust selector if probability - Merge branch 'master' into JDK-8323582-SW-native-alignment - remove multiversion mark if we break the structure - register opaque with igvn - copyright and rm CFG check - IR rules for all cases - 3 test versions - test changed to unaligned ints - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292 ------------- Changes: https://git.openjdk.org/jdk/pull/22016/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=03 Stats: 1089 lines in 27 files changed: 966 ins; 28 del; 95 mod Patch: https://git.openjdk.org/jdk/pull/22016.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016 PR: https://git.openjdk.org/jdk/pull/22016 From epeter at openjdk.org Tue Feb 25 09:36:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 09:36:58 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> References: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> Message-ID: On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov wrote: >> @vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree? > >> But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. > > Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. > > Okay. > > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. @vnkozlov @rwestrel - I did the `stall` -> `delay` renaming, and added some more comments in places you asked for it. Let me know if that looks better. - Filed: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if - I added a comment to [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751) C2 SuperWord: Aliasing Analysis runtime check, to check performance around slow_loop. Let me know what more I can do ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2681315131 From duke at openjdk.org Tue Feb 25 10:16:03 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 10:16:03 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception Message-ID: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. The obvious fix that is Cell limit = local(_outer->max_locals()); for (Cell c = start_cell(); c < limit; c = next_cell(c)) { since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course Cell limit = (Cell)(_outer->max_locals()); would work, but it seems to break (the very light) abstraction. I've also added an assert to transform the UB into a clear failure. This fix makes the UB warning go away on Mac with arm64. Thanks, Marc ------------- Commit messages: - Add a fix and an assert Changes: https://git.openjdk.org/jdk/pull/23772/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23772&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8347426 Stats: 12 lines in 2 files changed: 3 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23772.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23772/head:pull/23772 PR: https://git.openjdk.org/jdk/pull/23772 From jpai at openjdk.org Tue Feb 25 10:31:51 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Tue, 25 Feb 2025 10:31:51 GMT Subject: RFR: 8350614: [JMH] jdk.incubator.vector.VectorCommutativeOperSharingBenchmark fails In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 02:24:33 GMT, SendaoYan wrote: > Hi all, > > The newly added JMH test jdk.incubator.vector.VectorCommutativeOperSharingBenchmark run fails "java.lang.NoClassDefFoundError: jdk/incubator/vector/Vector". > > The `@Fork(jvmArgsPrepend = ..)` in microbenchmarks should replaced as `@Fork(jvmArgs = ..)` after [JDK-8343345](https://bugs.openjdk.org/browse/JDK-8343345). Change has been verified locally, test-fix only, no risk. The micro benchmark test, which this PR is fixing was introduced just yesterday through https://github.com/openjdk/jdk/pull/22863. I'll remove the core-libs label from this PR and add hotspot-compiler, given the original change which introduced this test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23761#issuecomment-2681491897 From thartmann at openjdk.org Tue Feb 25 11:02:00 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 25 Feb 2025 11:02:00 GMT Subject: RFR: 8348367: Remove hotspot_not_fast_compiler and hotspot_slow_compiler test groups [v2] In-Reply-To: References: Message-ID: <04Jcai2N6cGOekc1VTLcgI_AClLxwpdZ-o0k2iMfSoA=.f522ed25-9bdb-44cd-bf55-4dcc7894efe0@github.com> On Thu, 23 Jan 2025 06:22:20 GMT, Leonid Mesnik wrote: >> Test groups >> hotspot_not_fast_compiler and hotspot_slow_compiler test >> were used by Oracle. However, now tier2/tier3 are used instead. >> >> So fix remove groups. and add exclusion of "slow" tests from tier1 groups. The content remains the same except >> -compiler/memoryinitialization/ZeroTLABTest.java >> int tier1_compiler_3. >> >> >> The tier1/2/3 are organized similar to tiers in other groups where >> tier1 = set of groups >> - some long tests/groups >> tier2 = set of groups >> - some long tests/groups >> - tier1 >> tier3 = >> hotspot_compiler >> - tier1 >> - tier2 >> >> >> The test group >> compiler/codegen/ \ >> was in the tier1 and tier2, so should be removed it from tier2, not tests are affected. > > Leonid Mesnik has updated the pull request incrementally with one additional commit since the last revision: > > arraycopy reverted Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23253#pullrequestreview-2640628758 From mli at openjdk.org Tue Feb 25 11:16:59 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 25 Feb 2025 11:16:59 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v5] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 02:47:49 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> clean 2 > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1480: > >> 1478: void C2_MacroAssembler::string_compare_long_different_encoding(Register result, Register str1, Register str2, >> 1479: bool isLU, Register cnt1, Register cnt2, >> 1480: Register tmpL, Register tmpU, Register tmp3, > > There is a naming inconsistency for the params. The prototype in the header file says `Register tmp1, Register tmp2, Register tmp3,`. I think we can still use that naming here and create aliases like `tmpL` and `tmpU` as needed in this routine. Like: `Register tmpL = tmp1; Register tmpU = tmp2;` I'll modify the names in the header file, there is no need to introduce the renaming indirection. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1596: > >> 1594: // Load 4 bytes once to compare for alignment before main loop. Note that this >> 1595: // is only possible for LL/UU case. We need to resort to load_long_misaligned >> 1596: // for both LU and UL cases. > > This code comment needs to be moved to the proper place. I'll just remove it, seems not necessary anymore. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1969559909 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1969559808 From mli at openjdk.org Tue Feb 25 11:28:36 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 25 Feb 2025 11:28:36 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v6] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > Currently, `string_compare` code is a bit complicated, main reasons include: > 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. > 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. > > This is not good for code reading and maintaining. > > > So, this patch does following refactoring: > 1. merge LU and UL code into one, i.e. remove UL code. > 2. seperate the code into 2 methods: LL/UU and LU/UL. > 3. some other misc improvement. > > I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. > 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. > 2. make `SHORT_STRING` case simpler. > > > > Thanks Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: clean 3 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23633/files - new: https://git.openjdk.org/jdk/pull/23633/files/d67a23e3..e9c1807e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23633&range=04-05 Stats: 45 lines in 3 files changed: 13 ins; 24 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23633.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23633/head:pull/23633 PR: https://git.openjdk.org/jdk/pull/23633 From rcastanedalo at openjdk.org Tue Feb 25 11:39:55 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 25 Feb 2025 11:39:55 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v2] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: On Wed, 19 Feb 2025 15:23:11 GMT, Daniel Lund?n wrote: >> When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. >> >> ### Changeset >> >> It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. >> >> To illustrate the idealization and how it resolves this issue, consider the example below. >> >> ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) >> >> `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. >> >> We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. >> >> The changeset consists of the following changes. >> - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. >> - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. >> - Add multiple new regression tests in `TestGCMLoadPlacement.java`. >> >> For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/ac... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove test that no longer reproduces the issue Looks good, thanks for thoroughly exploring alternative solutions! I agree that this solution seems preferable to extending the antidependence analysis in the back-end. As we discussed earlier, relying on graph idealizations for correctness is not ideal (no pun intended) but probably pervasive in C2 anyway. I found the fact that `merge_width` is used both as a counter and as a predicate to enable/disable the transformation slightly confusing, consider decoupling these uses for clarity. Here is a sketch of my proposed refactoring, feel free to merge (or not) and edit to your liking: https://github.com/robcasloz/jdk/commit/8fdb4ba3cfff955221a16b28b29e0726f339cea5. I have not tested it, please double-check that it preserves the behavior of your current changeset. src/hotspot/share/opto/cfgnode.cpp line 1120: > 1118: } > 1119: Compile *C = igvn->C; > 1120: ResourceMark rm; I would suggest leaving this change to a separate (possible starter) RFE. src/hotspot/share/opto/cfgnode.cpp line 2437: > 2435: worklist.push(this); > 2436: visited.set(this->_idx); > 2437: auto add_to_worklist = [&](Node* input) { Suggestion: rename to `maybe_add_to_worklist` to stress that it is a conditional addition. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23691#pullrequestreview-2640681084 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1969570798 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1969572538 From fyang at openjdk.org Tue Feb 25 12:01:53 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 25 Feb 2025 12:01:53 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v6] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 11:28:36 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> Currently, `string_compare` code is a bit complicated, main reasons include: >> 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. >> 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. >> >> This is not good for code reading and maintaining. >> >> >> So, this patch does following refactoring: >> 1. merge LU and UL code into one, i.e. remove UL code. >> 2. seperate the code into 2 methods: LL/UU and LU/UL. >> 3. some other misc improvement. >> >> I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. >> 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. >> 2. make `SHORT_STRING` case simpler. >> >> >> >> Thanks > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > clean 3 src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1408: > 1406: addi(str1, str1, 4); > 1407: addi(str2, str2, 4); > 1408: subi(cnt2, cnt2, minCharsInWord / 2); Ah, I just realized that the check at [1] is discarded. Suppose this is LL case and input `cnt2` is 9, `cnt2` will be 5 after this `subi` instruction. This means the remaining number of latin chars is 5, so the two 8-byte loads at L1421 and L1422 will exceed the boundary. Maybe we should also consider moving the check at [2] to a proper place? [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1445 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1457 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1969624989 From roland at openjdk.org Tue Feb 25 12:28:57 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 25 Feb 2025 12:28:57 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> Message-ID: <3Y2_P27vCJLrsTslSkZtoSFkuLi1dOjHP-CSMysqdFk=.d8f55e46-8eb0-4157-ae24-a52e3d71cb68@github.com> On Mon, 24 Feb 2025 15:45:58 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: >> >> - review >> - review >> - review >> - Merge branch 'master' into JDK-8349361 >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Emanuel Peter >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Emanuel Peter >> - review >> - Update src/hotspot/share/opto/mulnode.hpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - ... and 5 more: https://git.openjdk.org/jdk/compare/81c4d62f...5b05d222 > > test/hotspot/jtreg/compiler/c2/irTests/RShiftLNodeIdealizationTests.java line 145: > >> 143: public long test9(long x) { >> 144: x = Integer.max(Integer.min((int)x, (int)test7Max), (int)(test7Min-1)); >> 145: return ((x << test7Shift) >> test7Shift); > > It could be nice to have some test cases where both shift values are completely randomized. The transformation only happens if the amounts we shift left and right are the same. So if they are random, the transformation won't apply most of the time and, rarely, it will (because they will turn out to be the same). I'm not sure how to write an IR test then. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1969662561 From epeter at openjdk.org Tue Feb 25 12:28:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 12:28:57 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: <3Y2_P27vCJLrsTslSkZtoSFkuLi1dOjHP-CSMysqdFk=.d8f55e46-8eb0-4157-ae24-a52e3d71cb68@github.com> References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> <3Y2_P27vCJLrsTslSkZtoSFkuLi1dOjHP-CSMysqdFk=.d8f55e46-8eb0-4157-ae24-a52e3d71cb68@github.com> Message-ID: <7D7P522wI5bsOeRi7aSh8QXNAlhqmlCi0-RBMl4Khpo=.bfc05889-fc07-4b5d-b68a-4c8c31f9e9ae@github.com> On Tue, 25 Feb 2025 12:23:58 GMT, Roland Westrelin wrote: >> test/hotspot/jtreg/compiler/c2/irTests/RShiftLNodeIdealizationTests.java line 145: >> >>> 143: public long test9(long x) { >>> 144: x = Integer.max(Integer.min((int)x, (int)test7Max), (int)(test7Min-1)); >>> 145: return ((x << test7Shift) >> test7Shift); >> >> It could be nice to have some test cases where both shift values are completely randomized. > > The transformation only happens if the amounts we shift left and right are the same. So if they are random, the transformation won't apply most of the time and, rarely, it will (because they will turn out to be the same). I'm not sure how to write an IR test then. You would not have to assert anything about the IR, just do value verification in that case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1969666205 From duke at openjdk.org Tue Feb 25 12:50:10 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 12:50:10 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v2] In-Reply-To: References: Message-ID: > This collapses double shift lefts by constants in a single constant: (x << con1) << con2 => x << (con1 + con2). Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. > > Moreover, the simplification logic of the sign extension trick had to be improved. For instance, we use `(x << 16) >> 16` to convert a 32 bits into a 16 bits integer, with sign extension. When storing this into a 16-bit field, this can be simplified into simple `x`. But in the case where `x` is itself a left-shift expression, say `y << 3`, this PR makes the IR looks like `(y << 19) >> 16` instead of the old `((y << 3) << 16) >> 16`. The former logic didn't handle the case where the left and the right shift have different magnitude. In this PR, I generalize this simplification to cases where the left shift has a larger magnitude than the right shift. This improvement was needed not to miss vectorization opportunities: without the simplification, we have a left shift and a right shift instead of a single left shift, which confuses the type inference. > > This also works for multiplications by powers of 2 since they are already translated into shifts. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with three additional commits since the last revision: - Remove useless local, with especially helpful name - rename - Add test suggested by @dean-long exhibiting the difference between (x << 30) << 3 and x << 33 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23728/files - new: https://git.openjdk.org/jdk/pull/23728/files/bb867b67..7e331349 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=00-01 Stats: 77 lines in 3 files changed: 70 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23728.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23728/head:pull/23728 PR: https://git.openjdk.org/jdk/pull/23728 From duke at openjdk.org Tue Feb 25 12:50:11 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 12:50:11 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v2] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 18:53:28 GMT, Jasmine Karthikeyan wrote: >> Marc Chevalier has updated the pull request incrementally with three additional commits since the last revision: >> >> - Remove useless local, with especially helpful name >> - rename >> - Add test suggested by @dean-long exhibiting the difference between (x << 30) << 3 and x << 33 > > src/hotspot/share/opto/mulnode.cpp line 981: > >> 979: // con0 is assumed to be masked already (as computed by maskShiftAmount) and non-zero >> 980: // bt must be T_LONG or T_INT. >> 981: static Node* collapseDoubleShiftLeft(PhaseGVN* phase, Node* outer_shift, int con0, BasicType bt) { > > From the style guide, functions and local variables are named with `snake_case`. Maybe it could be named `collapse_left_shifts`. Done. I've renamed to `collapse_nested_shift_left`, it looks more explicit to me. Feel free to tell me if you think there is better. I was inspired by the functions above that are in camelCase, unfortunately. > src/hotspot/share/opto/mulnode.cpp line 986: > >> 984: Node* inner_shift = outer_shift->in(1); >> 985: int inner_shift_op = inner_shift->Opcode(); >> 986: if (inner_shift_op != Op_LShift(bt)) { > > Since the local variable is otherwise unused, it'd be simpler to do: > Suggestion: > > if (inner_shift->Opcode() != Op_LShift(bt)) { And the variable name doesn't bring much. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1969710736 PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1969714492 From duke at openjdk.org Tue Feb 25 12:53:07 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 12:53:07 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v3] In-Reply-To: References: Message-ID: > This collapses double shift lefts by constants in a single constant: (x << con1) << con2 => x << (con1 + con2). Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. > > Moreover, the simplification logic of the sign extension trick had to be improved. For instance, we use `(x << 16) >> 16` to convert a 32 bits into a 16 bits integer, with sign extension. When storing this into a 16-bit field, this can be simplified into simple `x`. But in the case where `x` is itself a left-shift expression, say `y << 3`, this PR makes the IR looks like `(y << 19) >> 16` instead of the old `((y << 3) << 16) >> 16`. The former logic didn't handle the case where the left and the right shift have different magnitude. In this PR, I generalize this simplification to cases where the left shift has a larger magnitude than the right shift. This improvement was needed not to miss vectorization opportunities: without the simplification, we have a left shift and a right shift instead of a single left shift, which confuses the type inference. > > This also works for multiplications by powers of 2 since they are already translated into shifts. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Fix style in the few lines I haven't touched yet ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23728/files - new: https://git.openjdk.org/jdk/pull/23728/files/7e331349..84d18dfe Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=01-02 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23728.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23728/head:pull/23728 PR: https://git.openjdk.org/jdk/pull/23728 From duke at openjdk.org Tue Feb 25 12:59:21 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 12:59:21 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v4] In-Reply-To: References: Message-ID: > This collapses double shift lefts by constants in a single constant: (x << con1) << con2 => x << (con1 + con2). Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. > > Moreover, the simplification logic of the sign extension trick had to be improved. For instance, we use `(x << 16) >> 16` to convert a 32 bits into a 16 bits integer, with sign extension. When storing this into a 16-bit field, this can be simplified into simple `x`. But in the case where `x` is itself a left-shift expression, say `y << 3`, this PR makes the IR looks like `(y << 19) >> 16` instead of the old `((y << 3) << 16) >> 16`. The former logic didn't handle the case where the left and the right shift have different magnitude. In this PR, I generalize this simplification to cases where the left shift has a larger magnitude than the right shift. This improvement was needed not to miss vectorization opportunities: without the simplification, we have a left shift and a right shift instead of a single left shift, which confuses the type inference. > > This also works for multiplications by powers of 2 since they are already translated into shifts. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23728/files - new: https://git.openjdk.org/jdk/pull/23728/files/84d18dfe..c7920903 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=02-03 Stats: 8 lines in 1 file changed: 6 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23728.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23728/head:pull/23728 PR: https://git.openjdk.org/jdk/pull/23728 From duke at openjdk.org Tue Feb 25 12:59:21 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 12:59:21 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v4] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 18:41:21 GMT, Jasmine Karthikeyan wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> comment > > src/hotspot/share/opto/mulnode.cpp line 1018: > >> 1016: // constant, flatten the tree: (X+con1)< X<> 1017: // >> 1018: // (X << con1) << con2 ==> X << (con1 + con2) (see collapseDoubleShiftLeft for details) > > I think it would be better to move this comment where `collapseDoubleShiftLeft` is called in the Ideal function, and same for `LShiftLNode::Ideal`. I think I can repeat the comment, but I think it should also stay here since otherwise, the comment at the head of the method is incomplete: it says it performs `(X+con1)< X< References: Message-ID: <7tc3Q6VBD2QGu2tstDrVGICIMzofeN0docMxH9bVblQ=.b6453da8-ad4b-40a4-9f72-0d48a11d5d96@github.com> On Fri, 21 Feb 2025 18:39:23 GMT, Jasmine Karthikeyan wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> comment > > src/hotspot/share/opto/mulnode.cpp line 996: > >> 994: >> 995: if (con0 + con1 >= nbits) { >> 996: return ConNode::make(TypeInteger::zero(bt)); > > It'd be clearer to do this, which is more equivalent but more concise: > Suggestion: > > return phase->zerocon(bt); Actually, this is not equivalent and incorrect. I've did this exact mistake in an earlier version. The problem is that `zerocon` caches the nodes: https://github.com/openjdk/jdk/blob/8cfebc41dc8ec7b0d24d9c467b91de82d28b73fc/src/hotspot/share/opto/phaseX.cpp#L654-L656 So, then, we likely (or at least may) return an old node, which is not legal: `Ideal` is only allowed to return `this`, `nullptr` or a new node. But yes, it's unfortunate because it'd be much lighter to read. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1969744526 From mli at openjdk.org Tue Feb 25 13:28:54 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 25 Feb 2025 13:28:54 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v6] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 11:58:35 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> clean 3 > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1408: > >> 1406: addi(str1, str1, 4); >> 1407: addi(str2, str2, 4); >> 1408: subi(cnt2, cnt2, minCharsInWord / 2); > > Ah, I just realized that the check at [1] is discarded. Suppose this is LL case and input `cnt2` is 9, `cnt2` will be 5 after this `subi` instruction. This means the remaining number of latin chars is 5, so the two 8-byte loads at L1421 and L1422 will exceed the boundary. Maybe we should also consider moving the check at [2] to a proper place? > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1445 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1457 Previously, we have a `TAIL` label at [1], seems when it get here, it could also exceed the boundary of the string? But seems to me exceeding the boundary should be fine, as we can not exceed the page boundary in this way. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1550 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1969782048 From mdoerr at openjdk.org Tue Feb 25 14:22:52 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 25 Feb 2025 14:22:52 GMT Subject: RFR: 8350585: InlineSecondarySupersTest must be guarded on ppc64 by COMPILER2 In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 18:31:01 GMT, Matthias Baesken wrote: > In the minimal build we run into this error (e.g. on AIX) : > > src/hotspot/cpu/ppc/stubGenerator_ppc.cpp:4894:12: error: use of undeclared identifier 'InlineSecondarySupersTest' > if (!InlineSecondarySupersTest) { > > > The reason is the missing `COMPILER2` define check in the ppc64 coding. Looks good and trivial. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23753#pullrequestreview-2641240025 From mbaesken at openjdk.org Tue Feb 25 14:41:03 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 25 Feb 2025 14:41:03 GMT Subject: RFR: 8350585: InlineSecondarySupersTest must be guarded on ppc64 by COMPILER2 In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 18:31:01 GMT, Matthias Baesken wrote: > In the minimal build we run into this error (e.g. on AIX) : > > src/hotspot/cpu/ppc/stubGenerator_ppc.cpp:4894:12: error: use of undeclared identifier 'InlineSecondarySupersTest' > if (!InlineSecondarySupersTest) { > > > The reason is the missing `COMPILER2` define check in the ppc64 coding. Thanks for the reviews ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23753#issuecomment-2682212662 From mbaesken at openjdk.org Tue Feb 25 14:41:03 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 25 Feb 2025 14:41:03 GMT Subject: Integrated: 8350585: InlineSecondarySupersTest must be guarded on ppc64 by COMPILER2 In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 18:31:01 GMT, Matthias Baesken wrote: > In the minimal build we run into this error (e.g. on AIX) : > > src/hotspot/cpu/ppc/stubGenerator_ppc.cpp:4894:12: error: use of undeclared identifier 'InlineSecondarySupersTest' > if (!InlineSecondarySupersTest) { > > > The reason is the missing `COMPILER2` define check in the ppc64 coding. This pull request has now been integrated. Changeset: b17c0b63 Author: Matthias Baesken URL: https://git.openjdk.org/jdk/commit/b17c0b63a15246967f7cb24ba6089f2ef13e900e Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8350585: InlineSecondarySupersTest must be guarded on ppc64 by COMPILER2 Reviewed-by: amitkumar, mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/23753 From dlunden at openjdk.org Tue Feb 25 14:54:36 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 25 Feb 2025 14:54:36 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v3] In-Reply-To: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: <3cT-dlKaALs4JLaFhjD3l8VOjDBYB1OLsy9tSwhPG1Q=.ff725e97-aafe-427b-b6ad-d6b7fe76648b@github.com> > When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. > > ### Changeset > > It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. > > To illustrate the idealization and how it resolves this issue, consider the example below. > > ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) > > `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. > > We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. > > The changeset consists of the following changes. > - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. > - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. > - Add multiple new regression tests in `TestGCMLoadPlacement.java`. > > For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/13394882532) > - `tier1` to `tier4` (an... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update after review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23691/files - new: https://git.openjdk.org/jdk/pull/23691/files/7f702a68..c9621333 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23691&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23691&range=01-02 Stats: 21 lines in 1 file changed: 3 ins; 6 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/23691.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23691/head:pull/23691 PR: https://git.openjdk.org/jdk/pull/23691 From dlunden at openjdk.org Tue Feb 25 14:54:36 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 25 Feb 2025 14:54:36 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v2] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: On Wed, 19 Feb 2025 15:23:11 GMT, Daniel Lund?n wrote: >> When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. >> >> ### Changeset >> >> It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. >> >> To illustrate the idealization and how it resolves this issue, consider the example below. >> >> ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) >> >> `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. >> >> We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. >> >> The changeset consists of the following changes. >> - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. >> - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. >> - Add multiple new regression tests in `TestGCMLoadPlacement.java`. >> >> For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/ac... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove test that no longer reproduces the issue > I found the fact that `merge_width` is used both as a counter and as a predicate to enable/disable the transformation slightly confusing, consider decoupling these uses for clarity. Here is a sketch of my proposed refactoring, feel free to merge (or not) and edit to your liking: [robcasloz at 8fdb4ba](https://github.com/robcasloz/jdk/commit/8fdb4ba3cfff955221a16b28b29e0726f339cea5). I have not tested it, please double-check that it preserves the behavior of your current changeset. Looks good and more clear to me, fixed. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23691#issuecomment-2682249650 From dlunden at openjdk.org Tue Feb 25 14:54:36 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 25 Feb 2025 14:54:36 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v2] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: On Tue, 25 Feb 2025 11:23:28 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove test that no longer reproduces the issue > > src/hotspot/share/opto/cfgnode.cpp line 2437: > >> 2435: worklist.push(this); >> 2436: visited.set(this->_idx); >> 2437: auto add_to_worklist = [&](Node* input) { > > Suggestion: rename to `maybe_add_to_worklist` to stress that it is a conditional addition. Sure, now fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1969942203 From adinn at openjdk.org Tue Feb 25 14:55:29 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 25 Feb 2025 14:55:29 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity Message-ID: The compiler blob base size needs increasing in case the JDK is built without ZGC. The increment when ZGC is used can be comparably decreased. The final blob size increment when ZGC is included is over generous and can also be decreased. ------------- Commit messages: - Ensure final and compiler stub sizes correctly allow for ZGC Changes: https://git.openjdk.org/jdk/pull/23776/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23776&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349921 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23776.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23776/head:pull/23776 PR: https://git.openjdk.org/jdk/pull/23776 From dlunden at openjdk.org Tue Feb 25 15:01:57 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 25 Feb 2025 15:01:57 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v2] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: On Tue, 25 Feb 2025 11:22:15 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove test that no longer reproduces the issue > > src/hotspot/share/opto/cfgnode.cpp line 1120: > >> 1118: } >> 1119: Compile *C = igvn->C; >> 1120: ResourceMark rm; > > I would suggest leaving this change to a separate (possible starter) RFE. The problem with leaving this to a separate RFE is that the idealization changes in this changeset trigger a memory consumption bug related to this missing `ResourceMark`. Specifically, without the `ResourceMark` we now hit the 1GB memory limit in quite a number of tests. Therefore, it seems counterproductive to not fix as part of this changeset? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1969958922 From duke at openjdk.org Tue Feb 25 15:22:32 2025 From: duke at openjdk.org (Johannes Graham) Date: Tue, 25 Feb 2025 15:22:32 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v34] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: add comments Co-authored-by: Emanuel Peter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/40b1f9c4..bf8ba1b1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=33 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=32-33 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Tue Feb 25 16:21:59 2025 From: duke at openjdk.org (Johannes Graham) Date: Tue, 25 Feb 2025 16:21:59 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v33] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 07:49:27 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 334: >> >>> 332: var xor = (x & 0b111) ^ (y & 0b100); >>> 333: return xor < 0b1000; >>> 334: } >> >> These are nice simple examples, and we should keep them. But I'm missing these cases with randomization. >> >> `calc_xor_upper_bound_of_non_neg` basically has two input ranges `[0, hi_0]` and `[0, hi_1]`, and gives us a new range `[0,max]`. >> >> You could produce random input ranges like this: >> >> public void test(int x) { >> int lo_x = con1; >> int hi_x = con2; >> x = x < lo_x ? lo_x : (x > hi_x ? hi_x : x); >> // x clamped to [lo_x, hi_x] >> int lo_y = con3; >> int hi_y = con4; >> y = y < lo_y ? lo_y : (y > hi_y ? hi_y : y); >> // y clamped to [lo_y, hi_y] >> int z = x ^ y; >> // This should now have a new range, possibly some [0, max] >> // Now let's test the range with some random if branches. >> int sum = 0; >> if (z > somecon1) { sum += 1; } >> if (z > somecon2) { sum += 2; } >> if (z > somecon3) { sum += 4; } >> // maybe add a few more... >> if (z > someconi) { sum += pow(2,i); } >> return sum; >> } >> >> The `sum` at the end gives you a summary over all the checks. If one wrongly constant folds, you'll be missing one of the power of 2 contributions to it, or have it wrongly added. >> Now you do this with an `x` and a `y` >> >> Maybe that's a little over-engineered, but it would target the `calc_xor_upper_bound_of_non_neg` logic really well. >> What do you think? > > FYI I'm working on fuzzing the compiler currently, so I'm thinking about how to write such tests more generally, that's why I took the time to come up with this. If you have better ideas, I'm more than happy to see them ;) I started down the path of more elaborate randomization (see code removed in https://github.com/openjdk/jdk/pull/23089/commits/4a2912021103cefbe30fb3cc9e7d303b63ea454d). Instead of that approach. I went with doing the more detailed coverage with gtest, and having just a few specific checks in the IR tests. For example the "pow2" tests are there because that was a scenario that caused some trouble. I think there is a relatively small set of "interesting" cases that are nice to cover with something deterministic. (I could add a few more tests with hard-coded interesting values). It feels disproportionate to do something more complicated for this PR. That said, if there was more tooling in place to make more broad random tests easier to write, it would be very attractive. Looking at PR https://github.com/openjdk/jdk/pull/23418, it looks like that's what you've got in mind. Specifically, these kinds of tests with constants are begging for a way to spew out a whole bunch of related test methods. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1970116155 From duke at openjdk.org Tue Feb 25 16:24:59 2025 From: duke at openjdk.org (Johannes Graham) Date: Tue, 25 Feb 2025 16:24:59 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v33] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 16:19:39 GMT, Johannes Graham wrote: >> FYI I'm working on fuzzing the compiler currently, so I'm thinking about how to write such tests more generally, that's why I took the time to come up with this. If you have better ideas, I'm more than happy to see them ;) > > I started down the path of more elaborate randomization (see code removed in https://github.com/openjdk/jdk/pull/23089/commits/4a2912021103cefbe30fb3cc9e7d303b63ea454d). Instead of that approach. I went with doing the more detailed coverage with gtest, and having just a few specific checks in the IR tests. For example the "pow2" tests are there because that was a scenario that caused some trouble. I think there is a relatively small set of "interesting" cases that are nice to cover with something deterministic. (I could add a few more tests with hard-coded interesting values). It feels disproportionate to do something more complicated for this PR. > > That said, if there was more tooling in place to make more broad random tests easier to write, it would be very attractive. Looking at PR https://github.com/openjdk/jdk/pull/23418, it looks like that's what you've got in mind. Specifically, these kinds of tests with constants are begging for a way to spew out a whole bunch of related test methods. Also, as you commented on somewhere above, having a way to target Types from a stand-alone gtest, would also be a really nice way of making things more testable. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1970120588 From chagedorn at openjdk.org Tue Feb 25 16:45:00 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 25 Feb 2025 16:45:00 GMT Subject: RFR: 8348367: Remove hotspot_not_fast_compiler and hotspot_slow_compiler test groups [v2] In-Reply-To: References: Message-ID: On Thu, 23 Jan 2025 06:22:20 GMT, Leonid Mesnik wrote: >> Test groups >> hotspot_not_fast_compiler and hotspot_slow_compiler test >> were used by Oracle. However, now tier2/tier3 are used instead. >> >> So fix remove groups. and add exclusion of "slow" tests from tier1 groups. The content remains the same except >> -compiler/memoryinitialization/ZeroTLABTest.java >> int tier1_compiler_3. >> >> >> The tier1/2/3 are organized similar to tiers in other groups where >> tier1 = set of groups >> - some long tests/groups >> tier2 = set of groups >> - some long tests/groups >> - tier1 >> tier3 = >> hotspot_compiler >> - tier1 >> - tier2 >> >> >> The test group >> compiler/codegen/ \ >> was in the tier1 and tier2, so should be removed it from tier2, not tests are affected. > > Leonid Mesnik has updated the pull request incrementally with one additional commit since the last revision: > > arraycopy reverted Looks reasonable to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23253#pullrequestreview-2641773535 From duke at openjdk.org Tue Feb 25 17:05:46 2025 From: duke at openjdk.org (Johannes Graham) Date: Tue, 25 Feb 2025 17:05:46 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v35] In-Reply-To: References: Message-ID: <8PodSUC9-OCYrybF4WWcPcZew9hpV_xT64Yw7-PvHsk=.5562c096-e2a5-4ac0-8eac-69509fdbcc63@github.com> > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with two additional commits since the last revision: - widen range of test values; add missing comment - a few more tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/bf8ba1b1..b1e79dcd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=34 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=33-34 Stats: 51 lines in 3 files changed: 43 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From kvn at openjdk.org Tue Feb 25 17:32:02 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Feb 2025 17:32:02 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: > > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - stall -> delay, plus some more comments > - adjust selector if probability > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - remove multiversion mark if we break the structure > - register opaque with igvn > - copyright and rm CFG check > - IR rules for all cases > - 3 test versions > - test changed to unaligned ints > - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292 This looks good for me. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2641927937 From kvn at openjdk.org Tue Feb 25 17:32:02 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Feb 2025 17:32:02 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> Message-ID: <_pnjKfnS2e4hYWJ5_y8CudFAOmKB7FrD8cad8wCfZus=.16ac819a-2a99-4a8b-9640-3fa3bde53970@github.com> On Tue, 25 Feb 2025 07:09:24 GMT, Emanuel Peter wrote: > > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. > > For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"? I think I nit-picked here. I see your good comments in `loopTransform.cpp` and loop `node.hpp` explaining mutiversioning fast_loop/slow_loop. I think it is fine to keep "slow/fast". We can use "uncommon" to indicate unfrequent path. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2682745643 From rcastanedalo at openjdk.org Tue Feb 25 17:32:59 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 25 Feb 2025 17:32:59 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v2] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: <4zYXuVyDQ0ipaQlOWIb0mdXy9VgmqyinCVDBbGTZ4ug=.ad01cd5f-4be2-451a-bcc6-966c8097a660@github.com> On Tue, 25 Feb 2025 14:59:12 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/cfgnode.cpp line 1120: >> >>> 1118: } >>> 1119: Compile *C = igvn->C; >>> 1120: ResourceMark rm; >> >> I would suggest leaving this change to a separate (possible starter) RFE. > > The problem with leaving this to a separate RFE is that the idealization changes in this changeset trigger a memory consumption bug related to this missing `ResourceMark`. Specifically, without the `ResourceMark` we now hit the 1GB memory limit in quite a number of tests. Therefore, it seems counterproductive to not fix as part of this changeset? Fair enough, I was not aware of the interaction. Please leave this conversation unresolved as additional insight for future reviewers. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1970243106 From kvn at openjdk.org Tue Feb 25 17:42:59 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Feb 2025 17:42:59 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v4] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 08:00:10 GMT, Marc Chevalier wrote: >> Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. >> >> Thanks, >> Marc > > Marc Chevalier has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Improve deletion of [fr]rem after parsing The same new condition is use in 3 places - consider fold it into a separate function. ------------- Changes requested by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2641953804 From epeter at openjdk.org Tue Feb 25 17:45:54 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 17:45:54 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. test/micro/org/openjdk/bench/vm/compiler/MergeStores.java line 175: > 173: public byte[] store_B2_con_offs_allocate_bale() { > 174: byte[] aB = new byte[RANGE]; > 175: ByteArray.setShortLE(aB, offset, (short)0x0201); Did you run this benchmark to see if there is any impact? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23478#discussion_r1970261194 From epeter at openjdk.org Tue Feb 25 17:52:03 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 17:52:03 GMT Subject: RFR: 8342676: Unsigned Vector Min / Max transforms [v2] In-Reply-To: References: <21riF_Q0FMyzOh_sakTclKfYa-nJm4klfkyHEYi4ctI=.76933a14-fb5e-447e-873a-59a2b870b842@github.com> Message-ID: On Tue, 7 Jan 2025 08:58:12 GMT, Jatin Bhateja wrote: >> Adding following IR transforms for unsigned vector Min / Max nodes. >> >> => UMinV (UMinV(a, b), UMaxV(a, b)) => UMinV(a, b) >> => UMinV (UMinV(a, b), UMaxV(b, a)) => UMinV(a, b) >> => UMaxV (UMinV(a, b), UMaxV(a, b)) => UMaxV(a, b) >> => UMaxV (UMinV(a, b), UMaxV(b, a)) => UMaxV(a, b) >> => UMaxV (a, a) => a >> => UMinV (a, a) => a >> >> New IR validation test accompanies the patch. >> >> This is a follow-up PR for https://github.com/openjdk/jdk/pull/20507 >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Updating copyright year of modified files > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342676 > - Update IR transforms and tests > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342676 > - 8342676: Unsigned Vector Min / Max transforms @jatin-bhateja Just ping me here if this is ready for another review ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/21604#issuecomment-2682843503 From kvn at openjdk.org Tue Feb 25 18:52:57 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Feb 2025 18:52:57 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 14:48:26 GMT, Andrew Dinn wrote: > The compiler blob base size needs increasing in case the JDK is built without ZGC. The increment when ZGC is used can be comparably decreased. The final blob size increment when ZGC is included is over generous and can also be decreased. Good. src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp line 47: > 45: do_arch_entry, \ > 46: do_arch_entry_init) \ > 47: do_arch_blob(compiler, 35000 ZGC_ONLY(+5000)) \ Alignment of `` ------------- PR Review: https://git.openjdk.org/jdk/pull/23776#pullrequestreview-2642114927 PR Review Comment: https://git.openjdk.org/jdk/pull/23776#discussion_r1970350305 From dlunden at openjdk.org Tue Feb 25 19:11:15 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 25 Feb 2025 19:11:15 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v4] In-Reply-To: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: > When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. > > ### Changeset > > It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. > > To illustrate the idealization and how it resolves this issue, consider the example below. > > ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) > > `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. > > We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. > > The changeset consists of the following changes. > - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. > - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. > - Add multiple new regression tests in `TestGCMLoadPlacement.java`. > > For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/13394882532) > - `tier1` to `tier4` (an... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Fix subtle bug introduced in previous update ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23691/files - new: https://git.openjdk.org/jdk/pull/23691/files/c9621333..8e009abe Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23691&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23691&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23691.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23691/head:pull/23691 PR: https://git.openjdk.org/jdk/pull/23691 From lmesnik at openjdk.org Tue Feb 25 19:23:06 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Tue, 25 Feb 2025 19:23:06 GMT Subject: Integrated: 8348367: Remove hotspot_not_fast_compiler and hotspot_slow_compiler test groups In-Reply-To: References: Message-ID: On Thu, 23 Jan 2025 06:03:48 GMT, Leonid Mesnik wrote: > Test groups > hotspot_not_fast_compiler and hotspot_slow_compiler test > were used by Oracle. However, now tier2/tier3 are used instead. > > So fix remove groups. and add exclusion of "slow" tests from tier1 groups. The content remains the same except > -compiler/memoryinitialization/ZeroTLABTest.java > int tier1_compiler_3. > > > The tier1/2/3 are organized similar to tiers in other groups where > tier1 = set of groups > - some long tests/groups > tier2 = set of groups > - some long tests/groups > - tier1 > tier3 = > hotspot_compiler > - tier1 > - tier2 > > > The test group > compiler/codegen/ \ > was in the tier1 and tier2, so should be removed it from tier2, not tests are affected. This pull request has now been integrated. Changeset: 0151b15b Author: Leonid Mesnik URL: https://git.openjdk.org/jdk/commit/0151b15b7cc077a30b00f2af4a5e3f831d1d92cb Stats: 24 lines in 1 file changed: 5 ins; 16 del; 3 mod 8348367: Remove hotspot_not_fast_compiler and hotspot_slow_compiler test groups Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/23253 From lmesnik at openjdk.org Tue Feb 25 19:23:06 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Tue, 25 Feb 2025 19:23:06 GMT Subject: Integrated: 8339889: Several compiler tests ignore vm flags and not marked as flagless In-Reply-To: <5acZ_FmW23VeDgOFMiEuUa60TLxaOcC3wWZVwHFh8EU=.95188fc9-2f54-47af-a91c-4855db76f399@github.com> References: <5acZ_FmW23VeDgOFMiEuUa60TLxaOcC3wWZVwHFh8EU=.95188fc9-2f54-47af-a91c-4855db76f399@github.com> Message-ID: On Wed, 22 Jan 2025 01:10:44 GMT, Leonid Mesnik wrote: > Tests > compiler/c2/TestReduceAllocationAndHeapDump.java > compiler/calls/NativeCalls.java > compiler/debug/TestStress.java > compiler/inlining/TestDuplicatedLateInliningOutput.java > ignore vm flags using limited process builder and not marked as flagless. > > Please note that test > compiler/inlining/TestDuplicatedLateInliningOutput.java > is failing with some VM flags. See > https://bugs.openjdk.org/browse/JDK-8348214 > > I haven't excluded test, since it fail with certain non-common flags only. This pull request has now been integrated. Changeset: 829d7a84 Author: Leonid Mesnik URL: https://git.openjdk.org/jdk/commit/829d7a845e18ec483379abf3a3fccb596d899f25 Stats: 8 lines in 4 files changed: 4 ins; 0 del; 4 mod 8339889: Several compiler tests ignore vm flags and not marked as flagless Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/23224 From jbhateja at openjdk.org Tue Feb 25 20:13:38 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 25 Feb 2025 20:13:38 GMT Subject: RFR: 8342676: Unsigned Vector Min / Max transforms [v3] In-Reply-To: <21riF_Q0FMyzOh_sakTclKfYa-nJm4klfkyHEYi4ctI=.76933a14-fb5e-447e-873a-59a2b870b842@github.com> References: <21riF_Q0FMyzOh_sakTclKfYa-nJm4klfkyHEYi4ctI=.76933a14-fb5e-447e-873a-59a2b870b842@github.com> Message-ID: > Adding following IR transforms for unsigned vector Min / Max nodes. > > => UMinV (UMinV(a, b), UMaxV(a, b)) => UMinV(a, b) > => UMinV (UMinV(a, b), UMaxV(b, a)) => UMinV(a, b) > => UMaxV (UMinV(a, b), UMaxV(a, b)) => UMaxV(a, b) > => UMaxV (UMinV(a, b), UMaxV(b, a)) => UMaxV(a, b) > => UMaxV (a, a) => a > => UMinV (a, a) => a > > New IR validation test accompanies the patch. > > This is a follow-up PR for https://github.com/openjdk/jdk/pull/20507 > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Review suggestions incorporated. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342676 - Updating copyright year of modified files - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342676 - Update IR transforms and tests - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8342676 - 8342676: Unsigned Vector Min / Max transforms ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21604/files - new: https://git.openjdk.org/jdk/pull/21604/files/cc39220a..e9e09a5b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21604&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21604&range=01-02 Stats: 217192 lines in 5167 files changed: 105274 ins; 89254 del; 22664 mod Patch: https://git.openjdk.org/jdk/pull/21604.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21604/head:pull/21604 PR: https://git.openjdk.org/jdk/pull/21604 From duke at openjdk.org Tue Feb 25 21:13:36 2025 From: duke at openjdk.org (Marc Chevalier) Date: Tue, 25 Feb 2025 21:13:36 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v5] In-Reply-To: References: Message-ID: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Factor testing whether a node is a data proj of a pure function ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23694/files - new: https://git.openjdk.org/jdk/pull/23694/files/8208f4af..2582890a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=03-04 Stats: 14 lines in 4 files changed: 11 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23694.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23694/head:pull/23694 PR: https://git.openjdk.org/jdk/pull/23694 From fyang at openjdk.org Wed Feb 26 00:30:58 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 26 Feb 2025 00:30:58 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v6] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 13:26:20 GMT, Hamlin Li wrote: > Previously, we have a `TAIL` label at [1], seems when it get here, it could also exceed the boundary of the string? No, it won't. The sub instruction at [1][2][3] will ensure that. Let me know if you have a case that will. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1478 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1492 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1506 > But seems to me exceeding the boundary should be fine, as we can not exceed the page boundary in this way. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1550 But the loaded 64-bit values are compared later at L1435. So I think it will make a difference here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1970724146 From xgong at openjdk.org Wed Feb 26 01:23:50 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 26 Feb 2025 01:23:50 GMT Subject: RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors Message-ID: The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms. Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations. This patch added the rearrange support for vector types with small lane count. Here are the main changes: - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`) - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged. 1) NEON JMH on panama-vector:vectorIntrinsics: Benchmark (size) Mode Cnt Units Before After Gain Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.060 578.859 7.42x Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.332 1811.664 25.05x Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.256 1812.344 25.08x Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.879 558.797 7.18x Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.528 1981.304 28.09x Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.735 1994.168 27.79x Int64Vector.rearrange 1024 thrpt 30 ops/ms 76.374 562.106 7.36x Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 71.680 1190.127 16.60x Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.895 1185.094 16.48x Long128Vector.rearrange 1024 thrpt 30 ops/ms 78.902 579.250 7.34x Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.389 747.794 10.33x Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.999 747.848 10.38x JMH on jdk mainline: Benchmark (SIZE) Mode Cnt Units Before After Gain SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 ops/ms 44.593 1319.977 29.63x SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 ops/ms 22.318 660.061 29.58x SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 ops/ms 45.823 1458.144 31.82x SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 ops/ms 23.050 729.881 31.67x VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 ops/ms 97.210 1082.884 11.14x VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 ops/ms 48.642 541.341 11.13x VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 ops/ms 24.285 270.419 11.14x VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 ops/ms 12.421 135.115 10.88x 2) SVE JMH on panama-vector:vectorIntrinsics: Benchmark (size) Mode Cnt Units Before After Gain Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.396 577.744 7.37x Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.119 2538.261 35.19x Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.992 2536.972 34.75x Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.400 561.934 7.26x Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.858 2949.076 41.61x Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 70.654 2954.273 41.81x Int64Vector.rearrange 1024 thrpt 30 ops/ms 77.851 563.969 7.24x Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 67.433 1510.484 22.39x Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 66.614 1511.617 22.69x Long128Vector.rearrange 1024 thrpt 30 ops/ms 77.637 579.021 7.46x Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 69.886 1274.331 18.23x Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 70.069 1273.787 18.17x JMH on jdk mainline: Benchmark (SIZE) Mode Cnt Units Before After Gain SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 ops/ms 44.612 1351.850 30.30x SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 ops/ms 22.315 676.314 30.31x SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 ops/ms 46.372 1502.036 32.39x SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 ops/ms 23.361 749.133 32.07x VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 ops/ms 97.780 1759.061 17.99x VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 ops/ms 48.923 879.584 17.98x VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 ops/ms 24.219 439.588 18.15x VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 ops/ms 12.416 219.603 17.69x [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L209 [2] https://bugs.openjdk.org/browse/JDK-8310691 ------------- Commit messages: - 8350463: AArch64: Add vector rearrange support for small lane count vectors Changes: https://git.openjdk.org/jdk/pull/23790/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23790&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350463 Stats: 169 lines in 4 files changed: 60 ins; 86 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/23790.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23790/head:pull/23790 PR: https://git.openjdk.org/jdk/pull/23790 From dlong at openjdk.org Wed Feb 26 01:32:55 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 26 Feb 2025 01:32:55 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception In-Reply-To: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: On Tue, 25 Feb 2025 10:11:54 GMT, Marc Chevalier wrote: > As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. > > The obvious fix that is > > Cell limit = local(_outer->max_locals()); > for (Cell c = start_cell(); c < limit; c = next_cell(c)) { > > since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course > > Cell limit = (Cell)(_outer->max_locals()); > > would work, but it seems to break (the very light) abstraction. > > I've also added an assert to transform the UB into a clear failure. > > This fix makes the UB warning go away on Mac with arm64. > > Thanks, > Marc This seems fine. One alternative would be to introduce a new helper: Cell local_limit_cell() const { return (Cell)(outer()->max_locals()); } similar to the existing `limit_cell`. A 2nd alternative would be something like: for (int i = 0; i < _outer->max_locals(); ++i)) { Cell c = local(i); [....] ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23772#pullrequestreview-2642815317 From duke at openjdk.org Wed Feb 26 02:51:12 2025 From: duke at openjdk.org (Johannes Graham) Date: Wed, 26 Feb 2025 02:51:12 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v36] In-Reply-To: References: Message-ID: <6j6VGTWSXJdlOJOWkf80u2d9j27AsHETHhSY6685kOY=.a0db0f78-39a4-489a-ba42-813da344e026@github.com> > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: - Merge branch 'openjdk:master' into xor_const - widen range of test values; add missing comment - a few more tests - add comments Co-authored-by: Emanuel Peter - update tests - Fix formatting Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> - Merge branch 'openjdk:master' into xor_const - fix variable names in comments - update test - address review comments - ... and 41 more: https://git.openjdk.org/jdk/compare/037e4711...6d60ae2a ------------- Changes: https://git.openjdk.org/jdk/pull/23089/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=35 Stats: 430 lines in 5 files changed: 385 ins; 25 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From epeter at openjdk.org Wed Feb 26 06:31:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 06:31:58 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v11] In-Reply-To: References: Message-ID: On Tue, 28 Jan 2025 03:09:28 GMT, kuaiwei wrote: >> This patch enhance MergeStores optimization to support merge value with reverse byte order. >> >> Below is benchmark result before and after the patch: >> >> On aliyun g8y (aarch64) >> |name | before | score2 | ratio | >> |---|---|---|---| >> |MergeStoreBench.setCharBS |5669.655000 |5669.566000 | 0.00 %| >> |MergeStoreBench.setCharBV |5516.911000 |5516.273000 | 0.01 %| >> |MergeStoreBench.setCharC |5578.644000 |5552.809000 | 0.47 %| >> |MergeStoreBench.setCharLS |5782.140000 |5779.264000 | 0.05 %| >> |MergeStoreBench.setCharLV |5496.403000 |5499.195000 | -0.05 %| >> |MergeStoreBench.setIntB |6087.703000 |2768.385000 | 119.90 %| >> |MergeStoreBench.setIntBU |6733.813000 |2950.240000 | 128.25 %| >> |MergeStoreBench.setIntBV |1362.233000 |1361.821000 | 0.03 %| >> |MergeStoreBench.setIntL |2834.785000 |2833.042000 | 0.06 %| >> |MergeStoreBench.setIntLU |2947.145000 |2946.874000 | 0.01 %| >> |MergeStoreBench.setIntLV |5506.791000 |5506.229000 | 0.01 %| >> |MergeStoreBench.setIntRB |7634.279000 |5611.058000 | 36.06 %| >> |MergeStoreBench.setIntRBU |7766.737000 |5551.281000 | 39.91 %| >> |MergeStoreBench.setIntRL |5689.793000 |5689.385000 | 0.01 %| >> |MergeStoreBench.setIntRLU |5628.287000 |5628.789000 | -0.01 %| >> |MergeStoreBench.setIntRU |5536.039000 |5534.910000 | 0.02 %| >> |MergeStoreBench.setIntU |5595.363000 |5567.810000 | 0.49 %| >> |MergeStoreBench.setLongB |13722.671000 |6811.098000 | 101.48 %| >> |MergeStoreBench.setLongBU |13728.844000 |4280.240000 | 220.75 %| >> |MergeStoreBench.setLongBV |2785.255000 |2785.949000 | -0.02 %| >> |MergeStoreBench.setLongL |5714.615000 |5710.402000 | 0.07 %| >> |MergeStoreBench.setLongLU |4128.746000 |4129.324000 | -0.01 %| >> |MergeStoreBench.setLongLV |2793.125000 |2794.438000 | -0.05 %| >> |MergeStoreBench.setLongRB |14465.223000 |7015.050000 | 106.20 %| >> |MergeStoreBench.setLongRBU |14546.954000 |6173.210000 | 135.65 %| >> |MergeStoreBench.setLongRL |6816.145000 |6813.348000 | 0.04 %| >> |MergeStoreBench.setLongRLU |4289.445000 |4284.239000 | 0.12 %| >> |MergeStoreBench.setLongRU |3132.471000 |3133.093000 | -0.02 %| >> |MergeStoreBench.setLongU |3086.779000 |3087.298000 | -0.02 %| >> >> AMD EPYC 9T24 >> ... > > kuaiwei has updated the pull request incrementally with three additional commits since the last revision: > > - Allow ValueOrder::Reverse on big-endian platforms > - Revert "Merge more stores" > > This reverts commit 1e1113ed02ec5a9fe181f215d5667e8de487fe47. > - Revert "Fix test502aBE" > > This reverts commit f773fa368577c4f67957c4d40968c5c45e3ae205. @kuaiwei sorry for the long delay. Thanks for your patience, and more importantly your good work here! I have 2 nit-picks, but they are not very important. I'll run some internal testing now. src/hotspot/share/opto/memnode.cpp line 3020: > 3018: ValueOrder input_value_order = find_adjacent_input_value_order(n1, n2, memory_size); > 3019: > 3020: if (input_value_order == ValueOrder::NotAdjacent) { You check `input_value_order` against various cases here. Maybe a `switch` could be a good alternative. src/hotspot/share/opto/memnode.cpp line 3028: > 3026: !Matcher::match_rule_supported(Op_ReverseBytesI) || > 3027: !Matcher::match_rule_supported(Op_ReverseBytesL) || > 3028: !Matcher::match_rule_supported(Op_ReverseBytesS) Nit-pick: please order it from small to large: S -> I -> L Suggestion: !Matcher::match_rule_supported(Op_ReverseBytesS) || !Matcher::match_rule_supported(Op_ReverseBytesI) || !Matcher::match_rule_supported(Op_ReverseBytesL) test/hotspot/jtreg/compiler/c2/TestMergeStores.java line 727: > 725: applyIf = {"UseUnalignedAccesses", "true"}, > 726: applyIfPlatform = {"little-endian", "true"}) > 727: @IR(counts = {IRNode.STORE_B_OF_CLASS, "byte\\\\[int:>=0] \\\\(java/lang/Cloneable,java/io/Serializable\\\\)", "8", You may want to add the bug number here at the top of the file. ------------- PR Review: https://git.openjdk.org/jdk/pull/23030#pullrequestreview-2643204863 PR Review Comment: https://git.openjdk.org/jdk/pull/23030#discussion_r1970974802 PR Review Comment: https://git.openjdk.org/jdk/pull/23030#discussion_r1970973927 PR Review Comment: https://git.openjdk.org/jdk/pull/23030#discussion_r1970984644 From epeter at openjdk.org Wed Feb 26 06:39:06 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 06:39:06 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v11] In-Reply-To: References: Message-ID: <9kE5I8C-E2qz54gq5_Ia4UdbLYCbjjd2xqDTXSCFg88=.fda5eacf-1285-4276-b983-b586fb4b8613@github.com> On Tue, 28 Jan 2025 03:09:28 GMT, kuaiwei wrote: >> This patch enhance MergeStores optimization to support merge value with reverse byte order. >> >> Below is benchmark result before and after the patch: >> >> On aliyun g8y (aarch64) >> |name | before | score2 | ratio | >> |---|---|---|---| >> |MergeStoreBench.setCharBS |5669.655000 |5669.566000 | 0.00 %| >> |MergeStoreBench.setCharBV |5516.911000 |5516.273000 | 0.01 %| >> |MergeStoreBench.setCharC |5578.644000 |5552.809000 | 0.47 %| >> |MergeStoreBench.setCharLS |5782.140000 |5779.264000 | 0.05 %| >> |MergeStoreBench.setCharLV |5496.403000 |5499.195000 | -0.05 %| >> |MergeStoreBench.setIntB |6087.703000 |2768.385000 | 119.90 %| >> |MergeStoreBench.setIntBU |6733.813000 |2950.240000 | 128.25 %| >> |MergeStoreBench.setIntBV |1362.233000 |1361.821000 | 0.03 %| >> |MergeStoreBench.setIntL |2834.785000 |2833.042000 | 0.06 %| >> |MergeStoreBench.setIntLU |2947.145000 |2946.874000 | 0.01 %| >> |MergeStoreBench.setIntLV |5506.791000 |5506.229000 | 0.01 %| >> |MergeStoreBench.setIntRB |7634.279000 |5611.058000 | 36.06 %| >> |MergeStoreBench.setIntRBU |7766.737000 |5551.281000 | 39.91 %| >> |MergeStoreBench.setIntRL |5689.793000 |5689.385000 | 0.01 %| >> |MergeStoreBench.setIntRLU |5628.287000 |5628.789000 | -0.01 %| >> |MergeStoreBench.setIntRU |5536.039000 |5534.910000 | 0.02 %| >> |MergeStoreBench.setIntU |5595.363000 |5567.810000 | 0.49 %| >> |MergeStoreBench.setLongB |13722.671000 |6811.098000 | 101.48 %| >> |MergeStoreBench.setLongBU |13728.844000 |4280.240000 | 220.75 %| >> |MergeStoreBench.setLongBV |2785.255000 |2785.949000 | -0.02 %| >> |MergeStoreBench.setLongL |5714.615000 |5710.402000 | 0.07 %| >> |MergeStoreBench.setLongLU |4128.746000 |4129.324000 | -0.01 %| >> |MergeStoreBench.setLongLV |2793.125000 |2794.438000 | -0.05 %| >> |MergeStoreBench.setLongRB |14465.223000 |7015.050000 | 106.20 %| >> |MergeStoreBench.setLongRBU |14546.954000 |6173.210000 | 135.65 %| >> |MergeStoreBench.setLongRL |6816.145000 |6813.348000 | 0.04 %| >> |MergeStoreBench.setLongRLU |4289.445000 |4284.239000 | 0.12 %| >> |MergeStoreBench.setLongRU |3132.471000 |3133.093000 | -0.02 %| >> |MergeStoreBench.setLongU |3086.779000 |3087.298000 | -0.02 %| >> >> AMD EPYC 9T24 >> ... > > kuaiwei has updated the pull request incrementally with three additional commits since the last revision: > > - Allow ValueOrder::Reverse on big-endian platforms > - Revert "Merge more stores" > > This reverts commit 1e1113ed02ec5a9fe181f215d5667e8de487fe47. > - Revert "Fix test502aBE" > > This reverts commit f773fa368577c4f67957c4d40968c5c45e3ae205. To me the 3 nitpicks are optional, you don't have to apply them. Also I saw we already tested commit 16 / v10. I did launch testing again, just because there were a few weeks since the last time and sometimes bugs sneak via bad merges. Approved. But it would be good to have a second reviewer give an approval too. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23030#pullrequestreview-2643245477 From epeter at openjdk.org Wed Feb 26 06:47:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 06:47:58 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v36] In-Reply-To: <6j6VGTWSXJdlOJOWkf80u2d9j27AsHETHhSY6685kOY=.a0db0f78-39a4-489a-ba42-813da344e026@github.com> References: <6j6VGTWSXJdlOJOWkf80u2d9j27AsHETHhSY6685kOY=.a0db0f78-39a4-489a-ba42-813da344e026@github.com> Message-ID: <8VSIkxy_hN1Tvfh8N_BEch7UIMSnRMWg6HfAdbu_rZY=.038276e8-6ff4-4512-8d73-04a29d225c6d@github.com> On Wed, 26 Feb 2025 02:51:12 GMT, Johannes Graham wrote: >> An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. >> >> This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. >> >> In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: >> - Bounds optimization of xor >> - A check for `x ^ x = 0` >> - Explicit testing of xor over booleans. >> >> Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. >> >> --------- >> ### Progress >> - [x] Change must not contain extraneous whitespace >> - [x] Commit message must refer to an issue >> - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) >> >> >> >> ### Reviewers >> * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) >> >> ### Reviewing >>
Using git >> >> Checkout this PR locally: \ >> `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ >> `$ git checkout pull/23089` >> >> Update a local copy of the PR: \ >> `$ git checkout pull/23089` \ >> `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` >> >>
>>
Using Skara CLI tools >> >> Checkout this PR locally: \ >> `$ git pr checkout 23089` >> >> View PR using the GUI difftool: \ >> `$ git pr show -t 23089` >> >>
>>
Using diff file >> >> Download this PR as a diff file: \ >> https://git.openjdk.org/jdk/pull/23089.diff >> >>
>>
Using Webrev >> >> [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-25939... > > Johannes Graham has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: > > - Merge branch 'openjdk:master' into xor_const > - widen range of test values; add missing comment > - a few more tests > - add comments > > Co-authored-by: Emanuel Peter > - update tests > - Fix formatting > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> > - Merge branch 'openjdk:master' into xor_const > - fix variable names in comments > - update test > - address review comments > - ... and 41 more: https://git.openjdk.org/jdk/compare/037e4711...6d60ae2a test/hotspot/jtreg/compiler/c2/irTests/XorINodeIdealizationTests.java line 62: > 60: > 61: int min = Integer.MIN_VALUE; > 62: int max = MAX_VALUE; Let's make it consistent, and either use `Integer.` everywhere or without it everywhere ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1971010016 From epeter at openjdk.org Wed Feb 26 06:53:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 06:53:56 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v33] In-Reply-To: References: Message-ID: <2lzeygtsSsYdvG3kVtwe70o3y6lbUDmJjNCHvg4mE80=.f15e0732-7ef2-48f7-9ae9-5ceae645a3fe@github.com> On Tue, 25 Feb 2025 16:22:11 GMT, Johannes Graham wrote: >> I started down the path of more elaborate randomization (see code removed in https://github.com/openjdk/jdk/pull/23089/commits/4a2912021103cefbe30fb3cc9e7d303b63ea454d). Instead of that approach. I went with doing the more detailed coverage with gtest, and having just a few specific checks in the IR tests. For example the "pow2" tests are there because that was a scenario that caused some trouble. I think there is a relatively small set of "interesting" cases that are nice to cover with something deterministic. (I could add a few more tests with hard-coded interesting values). It feels disproportionate to do something more complicated for this PR. >> >> That said, if there was more tooling in place to make more broad random tests easier to write, it would be very attractive. Looking at PR https://github.com/openjdk/jdk/pull/23418, it looks like that's what you've got in mind. Specifically, these kinds of tests with constants are begging for a way to spew out a whole bunch of related test methods. > > Also, as you commented on somewhere above, having a way to target Types from a stand-alone gtest, would also be a really nice way of making things more testable. > For example the "pow2" tests are there because that was a scenario that caused some trouble. I think there is a relatively small set of "interesting" cases that are nice to cover with something deterministic. (I could add a few more tests with hard-coded interesting values). I know that there are only few "interesting" cases. That's why I came up with `Generators` so we would be more likely to hit the interesting cases. Of course we want to have a good number of deterministic cases. But extending that with some randomized tests that would find bugs over a longer time span are still valuable. The issue is often that there are edge cases, and humans are not very good at finding them. Thus with randomization we would at least eventually find those cases - hopefully. And in my experience I did find bugs with randomized tests. Sometimes we found bugs that were not related to the issue they were created for originally, but they found some other issue - and that is valuable too. > It feels disproportionate to do something more complicated for this PR. It's really not that complicated. All I'd be asking for is the test in the form that I provided above ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1971030846 From duke at openjdk.org Wed Feb 26 07:04:58 2025 From: duke at openjdk.org (Nicole Xu) Date: Wed, 26 Feb 2025 07:04:58 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException [v2] In-Reply-To: References: Message-ID: <3zzpTrqxv5KaBP-FKCAWjfffVonoWr9fKE6S8lO-cTY=.48f4cb20-f9e6-473f-8156-18d1694e7496@github.com> > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: > > > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > > > The variable `long256_arr_idx` is misused when indexing 'LongVector l2, l3, l4, l5' in function `maskedLogicOperationsLongKernel()`. 'long256_arr_idx' increases by 4 every time the benchmark runs and ensures the incremented value remains within the bounds of the array. However, for `LongVector.SPECIES_512`, it loads 8 numbers from the array each time the benchmark runs, resulting in an out-of-range indexing issue. > > Hence, we revised the index variables from `long256_arr_idx` to `long512_arr_idx`, which has a stride of 8, to ensure that the loaded vector is inside of the array boundary for all vector species. This is also consistent with other kernel functions. > > Additionally, some defined but unused variables have been removed. Nicole Xu has updated the pull request incrementally with two additional commits since the last revision: - 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 and AArch64 with the following error: ``` java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 ``` The variable `long256_arr_idx` is misused when indexing `LongVector l2`, `l3`, `l4`, `l5` in function `maskedLogicOperationsLongKernel()` resulting in the IndexOutOfBoundsException error. On the other hand, the unified index for 128-bit, 256-bit and 512-bit species might not be proper since it leaves gaps in between when accessing the data for 128-bit and 256-bit species. This will unnecessarily include the noise due to cache misses or (on some targets) prefetching additional cache lines which are not usable, thereby impacting the crispness of microbenchmark. Hence, we improved the benchmark from several aspects, 1. Used sufficient number of predicated operations within the vector loop while minimizing the noise due to memory operations. 2. Modified the index computation logic which can now withstand any ARRAYLEN without resulting in an IOOBE. 3. Removed redundant vector read/writes to instance fields, thus eliminating significant boxing penalty which translates into throughput gains. Change-Id: Ie8a9d495b1ca5e36f1eae069ff70a815a2de00c0 - Revert "8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException" This reverts commit 083bedec04d5ab78a420e156e74c1257ce30aee8. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22963/files - new: https://git.openjdk.org/jdk/pull/22963/files/083bedec..896c27ea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22963&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22963&range=00-01 Stats: 147 lines in 1 file changed: 14 ins; 29 del; 104 mod Patch: https://git.openjdk.org/jdk/pull/22963.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22963/head:pull/22963 PR: https://git.openjdk.org/jdk/pull/22963 From duke at openjdk.org Wed Feb 26 07:14:55 2025 From: duke at openjdk.org (Nicole Xu) Date: Wed, 26 Feb 2025 07:14:55 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 09:31:45 GMT, Jatin Bhateja wrote: > > Sure. Since I am very new to openJDK, I asked my teammate for help to file the follow-up RFE. > > Here is the https://bugs.openjdk.org/browse/JDK-8350215 with description of the discussed issues. > > Hi @xyyNicole , > > I have modified the benchmark keeping its essence intact, i.e. to use sufficient number of predicated operations within the vector loop while minimizing the noise due to memory operations. Modified the index computation logic which can now withstand any ARRAYLEN without resulting in an IOOBE. Removed redundant vector read/writes to instance fields, thus eliminating significant boxing penalty which translates into throughput gains. > > Please feel free to include it along with this patch. [MaskedLogicOpts.txt](https://github.com/user-attachments/files/18884093/MaskedLogicOpts.txt) Hi @jatin-bhateja , Thank you for your contributions. I've incorporated your modifications into this patch and made some minor formatting adjustments. ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2684128574 From jbhateja at openjdk.org Wed Feb 26 07:14:54 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 26 Feb 2025 07:14:54 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException [v2] In-Reply-To: <3zzpTrqxv5KaBP-FKCAWjfffVonoWr9fKE6S8lO-cTY=.48f4cb20-f9e6-473f-8156-18d1694e7496@github.com> References: <3zzpTrqxv5KaBP-FKCAWjfffVonoWr9fKE6S8lO-cTY=.48f4cb20-f9e6-473f-8156-18d1694e7496@github.com> Message-ID: <-XxRbrNVtFkEbaYMAihzsj9yfcyFC5cyW6VIh2rG_aU=.42dd122e-1222-49ac-b94d-e996c5c531f4@github.com> On Wed, 26 Feb 2025 07:04:58 GMT, Nicole Xu wrote: >> Suite `MaskedLogicOpts.maskedLogicOperationsLong512()` failed on both x86 and AArch64 with the following error: >> >> >> java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 >> >> >> The variable `long256_arr_idx` is misused when indexing `LongVector l2`, `l3`, `l4`, `l5` in function `maskedLogicOperationsLongKernel()` resulting in the IndexOutOfBoundsException error. On the other hand, the unified index for 128-bit, 256-bit and 512-bit species might not be proper since it leaves gaps in between when accessing the data for 128-bit and 256-bit species. This will unnecessarily include the noise due to cache misses or (on some targets) prefetching additional cache lines which are not usable, thereby impacting the crispness of microbenchmark. >> >> Hence, we improved the benchmark from several aspects, >> 1. Used sufficient number of predicated operations within the vector loop while minimizing the noise due to memory operations. >> 2. Modified the index computation logic which can now withstand any ARRAYLEN without resulting in an IOOBE. >> 3. Removed redundant vector read/writes to instance fields, thus eliminating significant boxing penalty which translates into throughput gains. > > Nicole Xu has updated the pull request incrementally with two additional commits since the last revision: > > - 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException > > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 > and AArch64 with the following error: > > ``` > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > ``` > > The variable `long256_arr_idx` is misused when indexing `LongVector l2`, > `l3`, `l4`, `l5` in function `maskedLogicOperationsLongKernel()` > resulting in the IndexOutOfBoundsException error. On the other hand, the > unified index for 128-bit, 256-bit and 512-bit species might not be > proper since it leaves gaps in between when accessing the data > for 128-bit and 256-bit species. This will unnecessarily include the > noise due to cache misses or (on some targets) prefetching additional > cache lines which are not usable, thereby impacting the crispness of > microbenchmark. > > Hence, we improved the benchmark from several aspects, > 1. Used sufficient number of predicated operations within the vector > loop while minimizing the noise due to memory operations. > 2. Modified the index computation logic which can now withstand any > ARRAYLEN without resulting in an IOOBE. > 3. Removed redundant vector read/writes to instance fields, thus > eliminating significant boxing penalty which translates into throughput > gains. > > Change-Id: Ie8a9d495b1ca5e36f1eae069ff70a815a2de00c0 > - Revert "8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException" > > This reverts commit 083bedec04d5ab78a420e156e74c1257ce30aee8. LGTM ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22963#pullrequestreview-2643351624 From duke at openjdk.org Wed Feb 26 07:38:57 2025 From: duke at openjdk.org (Marc Chevalier) Date: Wed, 26 Feb 2025 07:38:57 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception In-Reply-To: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: <3ScDNtCM55HHPvNcJtBinowPrsizN0ikJGbjZJIWMM8=.c0cf5cfe-a79f-4663-9c67-d6929a61469d@github.com> On Tue, 25 Feb 2025 10:11:54 GMT, Marc Chevalier wrote: > As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. > > The obvious fix that is > > Cell limit = local(_outer->max_locals()); > for (Cell c = start_cell(); c < limit; c = next_cell(c)) { > > since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course > > Cell limit = (Cell)(_outer->max_locals()); > > would work, but it seems to break (the very light) abstraction. > > I've also added an assert to transform the UB into a clear failure. > > This fix makes the UB warning go away on Mac with arm64. > > Thanks, > Marc Those are good options. I don't have a strong opinion. Happy to change if anybody has. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23772#issuecomment-2684168582 From epeter at openjdk.org Wed Feb 26 08:04:53 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 08:04:53 GMT Subject: RFR: 8350614: [JMH] jdk.incubator.vector.VectorCommutativeOperSharingBenchmark fails In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 02:24:33 GMT, SendaoYan wrote: > Hi all, > > The newly added JMH test jdk.incubator.vector.VectorCommutativeOperSharingBenchmark run fails "java.lang.NoClassDefFoundError: jdk/incubator/vector/Vector". > > The `@Fork(jvmArgsPrepend = ..)` in microbenchmarks should replaced as `@Fork(jvmArgs = ..)` after [JDK-8343345](https://bugs.openjdk.org/browse/JDK-8343345). Change has been verified locally, test-fix only, no risk. @cl4es made the change in [JDK-8343345](https://bugs.openjdk.org/browse/JDK-8343345), so he should probably have a quick look at this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23761#issuecomment-2684216145 From adinn at openjdk.org Wed Feb 26 08:36:27 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 26 Feb 2025 08:36:27 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity [v2] In-Reply-To: References: Message-ID: > The compiler blob base size needs increasing in case the JDK is built without ZGC. The increment when ZGC is used can be comparably decreased. The final blob size increment when ZGC is included is over generous and can also be decreased. Andrew Dinn has updated the pull request incrementally with two additional commits since the last revision: - another format fix - format fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23776/files - new: https://git.openjdk.org/jdk/pull/23776/files/7cae054c..159d5081 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23776&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23776&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23776.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23776/head:pull/23776 PR: https://git.openjdk.org/jdk/pull/23776 From adinn at openjdk.org Wed Feb 26 08:36:27 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 26 Feb 2025 08:36:27 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity [v2] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 18:50:30 GMT, Vladimir Kozlov wrote: >> Andrew Dinn has updated the pull request incrementally with two additional commits since the last revision: >> >> - another format fix >> - format fix > > Good. @vnkozlov Thanks for the review > src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp line 47: > >> 45: do_arch_entry, \ >> 46: do_arch_entry_init) \ >> 47: do_arch_blob(compiler, 35000 ZGC_ONLY(+5000)) \ > > Alignment of `` Fixed here and in final blob declaration ------------- PR Comment: https://git.openjdk.org/jdk/pull/23776#issuecomment-2684277391 PR Review Comment: https://git.openjdk.org/jdk/pull/23776#discussion_r1971150389 From rcastanedalo at openjdk.org Wed Feb 26 08:57:54 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 26 Feb 2025 08:57:54 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v4] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: On Tue, 25 Feb 2025 19:11:15 GMT, Daniel Lund?n wrote: >> When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. >> >> ### Changeset >> >> It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. >> >> To illustrate the idealization and how it resolves this issue, consider the example below. >> >> ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) >> >> `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. >> >> We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. >> >> The changeset consists of the following changes. >> - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. >> - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. >> - Add multiple new regression tests in `TestGCMLoadPlacement.java`. >> >> For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/ac... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Fix subtle bug introduced in previous update Looks good, thanks for addressing my suggestions! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23691#pullrequestreview-2643649439 From roland at openjdk.org Wed Feb 26 09:16:03 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 26 Feb 2025 09:16:03 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: Message-ID: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: > > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - stall -> delay, plus some more comments > - adjust selector if probability > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - remove multiversion mark if we break the structure > - register opaque with igvn > - copyright and rm CFG check > - IR rules for all cases > - 3 test versions > - test changed to unaligned ints > - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292 Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684365921 From adinn at openjdk.org Wed Feb 26 09:27:53 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 26 Feb 2025 09:27:53 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity [v2] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 18:50:30 GMT, Vladimir Kozlov wrote: >> Andrew Dinn has updated the pull request incrementally with two additional commits since the last revision: >> >> - another format fix >> - format fix > > Good. @vnkozlov Andrew Leonard has confirmed that the build problem originally encountered on the Adoptium build system is resolved by this patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23776#issuecomment-2684396319 From mbaesken at openjdk.org Wed Feb 26 09:30:08 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Wed, 26 Feb 2025 09:30:08 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms Message-ID: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. AIX crash is (linux ppc64le crash is similar) : # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 # Error: ShouldNotReachHere() # iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) sp: 0x000000011023aab0 (base - 0x2DD8) rtoc: 0x08001000a0088ff0 |---stackaddr----| |----lrsave------|: 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) 0x000000011023bc50 - 0x0900000008902848 libjvm.so::SharedRuntime::generate_native_wrapper(MacroAssembler*, methodHandle const&, int, BasicType*, VMRegPair*, BasicType)+0x380 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:6 parmsonstk:1) 0x000000011023c000 - 0x0900000008901f7c libjvm.so::AdapterHandlerLibrary::create_native_wrapper(methodHandle const&)+0x464 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:1 parmsonstk:1) 0x000000011023c5d0 - 0x09000000088f0d78 libjvm.so::Method::link_method(methodHandle const&, JavaThread*)+0x17c (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:3 parmsonstk:1) 0x000000011023c670 - 0x09000000088f0ab8 libjvm.so::InstanceKlass::link_methods(JavaThread*)+0xcc (C++ uses_alloca saves_lr stores_bc gpr_saved:10 fixedparms:2 parmsonstk:1) 0x000000011023c760 - 0x09000000088d5e48 libjvm.so::InstanceKlass::link_class_impl(JavaThread*)+0x38c (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:10 fixedparms:2 parmsonstk:1) 0x000000011023c8f0 - 0x09000000088d4de8 libjvm.so::InstanceKlass::initialize_impl(JavaThread*)+0x94 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:14 fixedparms:2 parmsonstk:1) 0x000000011023cae0 - 0x09000000088d4cf4 libjvm.so::InstanceKlass::initialize(JavaThread*)+0x60 (C++ uses_alloca saves_lr stores_bc gpr_saved:3 fixedparms:2 parmsonstk:1) 0x000000011023cb70 - 0x0900000008c40ea0 libjvm.so::initialize_class(Symbol*, JavaThread*)+0x64 (C++ uses_alloca saves_lr stores_bc gpr_saved:2 fixedparms:2 parmsonstk:1) 0x000000011023cc00 - 0x0900000008c35384 libjvm.so::Threads::create_vm(JavaVMInitArgs*, bool*)+0x6ec (C++ uses_alloca saves_lr stores_bc gpr_saved:8 fixedparms:2 parmsonstk:1) 0x000000011023d540 - 0x0900000008c5a91c libjvm.so::JNI_CreateJavaVM+0xa4 (C++ uses_alloca saves_lr stores_bc gpr_saved:8 fixedparms:3 parmsonstk:1) 0x000000011023d640 - 0x00000001000100f4 javac::JavaMain+0x148 (C++ saves_lr stores_bc gpr_saved:11 fixedparms:1 parmsonstk:1) 0x000000011023d730 - 0x000000010000ff74 javac::ThreadJavaMain+0x10 (C++ saves_lr stores_bc fixedparms:1 parmsonstk:1) 0x000000011023d7a0 - 0x090000000056204c libpthreads.a::_pthread_body+0xec (C saves_lr stores_bc gpr_saved:1 fixedparms:1 ) 0x000000011023d820 - 0x0000000000000000 ------------- Commit messages: - JDK-8350683 Changes: https://git.openjdk.org/jdk/pull/23794/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23794&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350683 Stats: 11 lines in 1 file changed: 0 ins; 8 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23794.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23794/head:pull/23794 PR: https://git.openjdk.org/jdk/pull/23794 From mli at openjdk.org Wed Feb 26 09:58:59 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 26 Feb 2025 09:58:59 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v6] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 00:27:23 GMT, Fei Yang wrote: >> Previously, we have a `TAIL` label at [1], seems when it get here, it could also exceed the boundary of the string? >> But seems to me exceeding the boundary should be fine, as we can not exceed the page boundary in this way. >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1550 > >> Previously, we have a `TAIL` label at [1], seems when it get here, it could also exceed the boundary of the string? > > No, it won't. The sub instruction at [1][2][3] will ensure that. Let me know if you have a case that will. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1478 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1492 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1506 > >> But seems to me exceeding the boundary should be fine, as we can not exceed the page boundary in this way. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1550 > > But the loaded 64-bit values are compared later at L1435. So I think it will make a difference here. Consider the code reaches [bltz(cnt2, NEXT_WORD);](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1549), and go to the next line [bind(TAIL);](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1550), the remaining chars/bytes in the string could be 1~7, but we will load 8 or 16 bytes anyway. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1971285704 From epeter at openjdk.org Wed Feb 26 10:02:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 10:02:09 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 09:12:46 GMT, Roland Westrelin wrote: > Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass. I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. @rwestrel What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684482233 From roland at openjdk.org Wed Feb 26 10:18:12 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 26 Feb 2025 10:18:12 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 09:59:36 GMT, Emanuel Peter wrote: > I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? Ok > And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if > > I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684523673 From epeter at openjdk.org Wed Feb 26 10:30:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 10:30:15 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin wrote: > > And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if > > I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. > > I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. Ah ok, I'll have to look into it myself then. But if we know that it happens at the beginning of a loop-opts phase just after igvn, and no predicates were hacked yet, then that should work fine. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684550571 From epeter at openjdk.org Wed Feb 26 10:36:06 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 10:36:06 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin wrote: >>> Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass. >> >> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? >> >> I don't see it as super critical personally, as the slow_path is `delayed`, so no loop-opts are performed on it. The overhead is minimal if we keep it until after loop-opts, I think. But I'm not against trying. It would take a bit of effort to construct test cases where we have the loop fold away after multiversion_if is added, but that is probably possible. >> >> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: >> [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if >> >> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. >> >> @rwestrel What do you think? > >> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? > > Ok > >> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if >> >> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. > > I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. @rwestrel I filed this follow-up RFE: [JDK-8350756](https://bugs.openjdk.org/browse/JDK-8350756): C2 SuperWord Multiversioning: remove useless slow loop when the fast loop disappears We'll have to be careful to only fold the `slow_loop` away if it is not used, i.e. if we did not in the meantime use the `multiversion_if`, and maybe the `fast_loop` structure is only desintegrating because of some speculative assumption, maybe because of more unrolling that only happens with vectorization. It would be good to have a test-case for that. I'm writing that here so I will remember it later ;) @rwestrel Do you have any other ideas / suggestions? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684567780 From duke at openjdk.org Wed Feb 26 10:54:00 2025 From: duke at openjdk.org (Nicole Xu) Date: Wed, 26 Feb 2025 10:54:00 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException [v2] In-Reply-To: References: <52HO_iL9asn1huCdJj82R1AwF1w8ON9HZetrdc9rQyQ=.28e137e0-a7f7-4839-a3e7-eda4f8a6c4f5@github.com> Message-ID: On Tue, 25 Feb 2025 00:00:23 GMT, Paul Sandoz wrote: >> Thanks for pointing that out. Typically, ARRAYLEN is almost always a POT value, which is also assumed by many other benchmarks. Are we realistically going to test with an ARRAYLEN of 30? >> >> I think the POT assumption is reasonable for our purposes. > > It's a reasonable assumption. Since `ARRAYLEN` is a parameter of the benchmark we should enforce that constraint in benchmark initialization method, checking if the value is POT and failing otherwise. Hi, @PaulSandoz, thanks for your suggestions. In @jatin-bhateja's latest updates, non-POT `ARRAYLEN` values have been supported by processing only the elements up to the nearest multiple of the vector length. In that case, we can use a wider range of `ARRAYLEN` values while focusing on the main loop for vector functionality. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22963#discussion_r1971372152 From duke at openjdk.org Wed Feb 26 10:58:32 2025 From: duke at openjdk.org (kuaiwei) Date: Wed, 26 Feb 2025 10:58:32 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v12] In-Reply-To: References: Message-ID: > This patch enhance MergeStores optimization to support merge value with reverse byte order. > > Below is benchmark result before and after the patch: > > On aliyun g8y (aarch64) > |name | before | score2 | ratio | > |---|---|---|---| > |MergeStoreBench.setCharBS |5669.655000 |5669.566000 | 0.00 %| > |MergeStoreBench.setCharBV |5516.911000 |5516.273000 | 0.01 %| > |MergeStoreBench.setCharC |5578.644000 |5552.809000 | 0.47 %| > |MergeStoreBench.setCharLS |5782.140000 |5779.264000 | 0.05 %| > |MergeStoreBench.setCharLV |5496.403000 |5499.195000 | -0.05 %| > |MergeStoreBench.setIntB |6087.703000 |2768.385000 | 119.90 %| > |MergeStoreBench.setIntBU |6733.813000 |2950.240000 | 128.25 %| > |MergeStoreBench.setIntBV |1362.233000 |1361.821000 | 0.03 %| > |MergeStoreBench.setIntL |2834.785000 |2833.042000 | 0.06 %| > |MergeStoreBench.setIntLU |2947.145000 |2946.874000 | 0.01 %| > |MergeStoreBench.setIntLV |5506.791000 |5506.229000 | 0.01 %| > |MergeStoreBench.setIntRB |7634.279000 |5611.058000 | 36.06 %| > |MergeStoreBench.setIntRBU |7766.737000 |5551.281000 | 39.91 %| > |MergeStoreBench.setIntRL |5689.793000 |5689.385000 | 0.01 %| > |MergeStoreBench.setIntRLU |5628.287000 |5628.789000 | -0.01 %| > |MergeStoreBench.setIntRU |5536.039000 |5534.910000 | 0.02 %| > |MergeStoreBench.setIntU |5595.363000 |5567.810000 | 0.49 %| > |MergeStoreBench.setLongB |13722.671000 |6811.098000 | 101.48 %| > |MergeStoreBench.setLongBU |13728.844000 |4280.240000 | 220.75 %| > |MergeStoreBench.setLongBV |2785.255000 |2785.949000 | -0.02 %| > |MergeStoreBench.setLongL |5714.615000 |5710.402000 | 0.07 %| > |MergeStoreBench.setLongLU |4128.746000 |4129.324000 | -0.01 %| > |MergeStoreBench.setLongLV |2793.125000 |2794.438000 | -0.05 %| > |MergeStoreBench.setLongRB |14465.223000 |7015.050000 | 106.20 %| > |MergeStoreBench.setLongRBU |14546.954000 |6173.210000 | 135.65 %| > |MergeStoreBench.setLongRL |6816.145000 |6813.348000 | 0.04 %| > |MergeStoreBench.setLongRLU |4289.445000 |4284.239000 | 0.12 %| > |MergeStoreBench.setLongRU |3132.471000 |3133.093000 | -0.02 %| > |MergeStoreBench.setLongU |3086.779000 |3087.298000 | -0.02 %| > > AMD EPYC 9T24 > |name | before | after | ratio | > |---|---|---|---| > |MergeStoreBench.setChar... kuaiwei has updated the pull request incrementally with one additional commit since the last revision: Fix for review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23030/files - new: https://git.openjdk.org/jdk/pull/23030/files/d910d0f6..d4a70196 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23030&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23030&range=10-11 Stats: 27 lines in 2 files changed: 5 ins; 2 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/23030.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23030/head:pull/23030 PR: https://git.openjdk.org/jdk/pull/23030 From duke at openjdk.org Wed Feb 26 10:58:32 2025 From: duke at openjdk.org (kuaiwei) Date: Wed, 26 Feb 2025 10:58:32 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v11] In-Reply-To: <9kE5I8C-E2qz54gq5_Ia4UdbLYCbjjd2xqDTXSCFg88=.fda5eacf-1285-4276-b983-b586fb4b8613@github.com> References: <9kE5I8C-E2qz54gq5_Ia4UdbLYCbjjd2xqDTXSCFg88=.fda5eacf-1285-4276-b983-b586fb4b8613@github.com> Message-ID: On Wed, 26 Feb 2025 06:36:30 GMT, Emanuel Peter wrote: > To me the 3 nitpicks are optional, you don't have to apply them. Also I saw we already tested commit 16 / v10. I did launch testing again, just because there were a few weeks since the last time and sometimes bugs sneak via bad merges. > > Approved. But it would be good to have a second reviewer give an approval too. > > I'm excited to see what you do with this one ? [JDK-8345485](https://bugs.openjdk.org/browse/JDK-8345485) C2 MergeLoads: merge adjacent array/native memory loads into larger load Thanks for your approval. I've fixed as comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2684606802 From epeter at openjdk.org Wed Feb 26 10:58:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 10:58:32 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v12] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 10:55:25 GMT, kuaiwei wrote: >> This patch enhance MergeStores optimization to support merge value with reverse byte order. >> >> Below is benchmark result before and after the patch: >> >> On aliyun g8y (aarch64) >> |name | before | score2 | ratio | >> |---|---|---|---| >> |MergeStoreBench.setCharBS |5669.655000 |5669.566000 | 0.00 %| >> |MergeStoreBench.setCharBV |5516.911000 |5516.273000 | 0.01 %| >> |MergeStoreBench.setCharC |5578.644000 |5552.809000 | 0.47 %| >> |MergeStoreBench.setCharLS |5782.140000 |5779.264000 | 0.05 %| >> |MergeStoreBench.setCharLV |5496.403000 |5499.195000 | -0.05 %| >> |MergeStoreBench.setIntB |6087.703000 |2768.385000 | 119.90 %| >> |MergeStoreBench.setIntBU |6733.813000 |2950.240000 | 128.25 %| >> |MergeStoreBench.setIntBV |1362.233000 |1361.821000 | 0.03 %| >> |MergeStoreBench.setIntL |2834.785000 |2833.042000 | 0.06 %| >> |MergeStoreBench.setIntLU |2947.145000 |2946.874000 | 0.01 %| >> |MergeStoreBench.setIntLV |5506.791000 |5506.229000 | 0.01 %| >> |MergeStoreBench.setIntRB |7634.279000 |5611.058000 | 36.06 %| >> |MergeStoreBench.setIntRBU |7766.737000 |5551.281000 | 39.91 %| >> |MergeStoreBench.setIntRL |5689.793000 |5689.385000 | 0.01 %| >> |MergeStoreBench.setIntRLU |5628.287000 |5628.789000 | -0.01 %| >> |MergeStoreBench.setIntRU |5536.039000 |5534.910000 | 0.02 %| >> |MergeStoreBench.setIntU |5595.363000 |5567.810000 | 0.49 %| >> |MergeStoreBench.setLongB |13722.671000 |6811.098000 | 101.48 %| >> |MergeStoreBench.setLongBU |13728.844000 |4280.240000 | 220.75 %| >> |MergeStoreBench.setLongBV |2785.255000 |2785.949000 | -0.02 %| >> |MergeStoreBench.setLongL |5714.615000 |5710.402000 | 0.07 %| >> |MergeStoreBench.setLongLU |4128.746000 |4129.324000 | -0.01 %| >> |MergeStoreBench.setLongLV |2793.125000 |2794.438000 | -0.05 %| >> |MergeStoreBench.setLongRB |14465.223000 |7015.050000 | 106.20 %| >> |MergeStoreBench.setLongRBU |14546.954000 |6173.210000 | 135.65 %| >> |MergeStoreBench.setLongRL |6816.145000 |6813.348000 | 0.04 %| >> |MergeStoreBench.setLongRLU |4289.445000 |4284.239000 | 0.12 %| >> |MergeStoreBench.setLongRU |3132.471000 |3133.093000 | -0.02 %| >> |MergeStoreBench.setLongU |3086.779000 |3087.298000 | -0.02 %| >> >> AMD EPYC 9T24 >> ... > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > Fix for review comments Changes requested by epeter (Reviewer). src/hotspot/share/opto/memnode.cpp line 3030: > 3028: // ReverseBytes are not supported by platform > 3029: return false; > 3030: } Suggestion: } // fall-through. Just to make it more explicit to the reader ;) ------------- PR Review: https://git.openjdk.org/jdk/pull/23030#pullrequestreview-2644038791 PR Review Comment: https://git.openjdk.org/jdk/pull/23030#discussion_r1971374887 From epeter at openjdk.org Wed Feb 26 10:58:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 10:58:32 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v11] In-Reply-To: References: <9kE5I8C-E2qz54gq5_Ia4UdbLYCbjjd2xqDTXSCFg88=.fda5eacf-1285-4276-b983-b586fb4b8613@github.com> Message-ID: On Wed, 26 Feb 2025 10:50:23 GMT, kuaiwei wrote: >> To me the 3 nitpicks are optional, you don't have to apply them. >> Also I saw we already tested commit 16 / v10. I did launch testing again, just because there were a few weeks since the last time and sometimes bugs sneak via bad merges. >> >> Approved. But it would be good to have a second reviewer give an approval too. >> >> I'm excited to see what you do with this one ? >> [JDK-8345485](https://bugs.openjdk.org/browse/JDK-8345485) C2 MergeLoads: merge adjacent array/native memory loads into larger load > >> To me the 3 nitpicks are optional, you don't have to apply them. Also I saw we already tested commit 16 / v10. I did launch testing again, just because there were a few weeks since the last time and sometimes bugs sneak via bad merges. >> >> Approved. But it would be good to have a second reviewer give an approval too. >> >> I'm excited to see what you do with this one ? [JDK-8345485](https://bugs.openjdk.org/browse/JDK-8345485) C2 MergeLoads: merge adjacent array/native memory loads into larger load > > Thanks for your approval. I've fixed as comments. @kuaiwei Would you mind merging with master? Then we can run testing with the most up-to-date version and have a lower risk of bad merges on integration ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2684611208 From epeter at openjdk.org Wed Feb 26 11:18:08 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 11:18:08 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v44] In-Reply-To: References: Message-ID: <_sz68hEt9TeSdTUMkhFPSagjM5MuVVa1RK3pvrkjJmA=.abb9a9d0-e4fc-43b4-9d7f-e683b06441e5@github.com> On Fri, 14 Feb 2025 12:59:13 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch adds unsigned bounds and known bits constraints to TypeInt and TypeLong. This opens more transformation opportunities in an elegant manner as well as helps avoid some ad-hoc rules in Hotspot. >> >> In general, a `TypeInt/Long` represents a set of values `x` that satisfies: `x s>= lo && x s<= hi && x u>= ulo && x u<= uhi && (x & zeros) == 0 && (x & ones) == ones`. These constraints are not independent, e.g. an int that lies in [0, 3] in signed domain must also lie in [0, 3] in unsigned domain and have all bits but the last 2 being unset. As a result, we must canonicalize the constraints (tighten the constraints so that they are optimal) before constructing a `TypeInt/Long` instance. >> >> This is extracted from #15440 , node value transformations are left for later PRs. I have also added unit tests to verify the soundness of constraint normalization. >> >> Please kindly review, thanks a lot. >> >> Testing >> >> - [x] GHA >> - [x] Linux x64, tier 1-4 > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > refine comments Nice improvements @merykitty ! I think the "2. Formality" section is very strong now. I'll continue reviewing the rest soon. src/hotspot/share/opto/rangeinference.cpp line 158: > 156: bits at x > i have lower significance, and are thus irrelevant > 157: > 158: a.2. We have v satisfies bits, this is because: Suggestion: a.2. We know v satisfies bits, this is because: src/hotspot/share/opto/rangeinference.cpp line 161: > 159: v[x] satisfies bits for 0 <= x < i (according to 2.2 and 2.5) > 160: v[i] satisfies bits: > 161: According to 2.3 and 2.6, v[i] == 1 and zeros[i] == 0, v[i] does not violate Suggestion: According to 2.3 and 2.6, zeros[i] == 0 and v[i] == 1, v[i] does not violate Keep the same order as the references. src/hotspot/share/opto/rangeinference.cpp line 187: > 185: bits at x > j have lower significance, and are thus irrelevant > 186: > 187: Which leads to r < lo, which contradicts that r >= lo Suggestion: Which leads to r < lo, which contradicts that r >= lo (according to definition of r) It could be nice to define it more explicitly, and give the definition of `r` a title / name. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17508#pullrequestreview-2644060991 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1971386552 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1971390112 PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1971398511 From epeter at openjdk.org Wed Feb 26 11:18:08 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 11:18:08 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v44] In-Reply-To: <_sz68hEt9TeSdTUMkhFPSagjM5MuVVa1RK3pvrkjJmA=.abb9a9d0-e4fc-43b4-9d7f-e683b06441e5@github.com> References: <_sz68hEt9TeSdTUMkhFPSagjM5MuVVa1RK3pvrkjJmA=.abb9a9d0-e4fc-43b4-9d7f-e683b06441e5@github.com> Message-ID: On Wed, 26 Feb 2025 11:00:32 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> refine comments > > src/hotspot/share/opto/rangeinference.cpp line 158: > >> 156: bits at x > i have lower significance, and are thus irrelevant >> 157: >> 158: a.2. We have v satisfies bits, this is because: > > Suggestion: > > a.2. We know v satisfies bits, this is because: Or more short: `v satisfies bits, because:` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17508#discussion_r1971387164 From duke at openjdk.org Wed Feb 26 12:22:46 2025 From: duke at openjdk.org (kuaiwei) Date: Wed, 26 Feb 2025 12:22:46 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v13] In-Reply-To: References: Message-ID: <6XEmUiapz_UElQlM-x5g61YOp2DSqSh3b0Vdq1jsWx8=.c7cd0945-47e4-49cb-bbd4-e6b7bc06c743@github.com> > This patch enhance MergeStores optimization to support merge value with reverse byte order. > > Below is benchmark result before and after the patch: > > On aliyun g8y (aarch64) > |name | before | score2 | ratio | > |---|---|---|---| > |MergeStoreBench.setCharBS |5669.655000 |5669.566000 | 0.00 %| > |MergeStoreBench.setCharBV |5516.911000 |5516.273000 | 0.01 %| > |MergeStoreBench.setCharC |5578.644000 |5552.809000 | 0.47 %| > |MergeStoreBench.setCharLS |5782.140000 |5779.264000 | 0.05 %| > |MergeStoreBench.setCharLV |5496.403000 |5499.195000 | -0.05 %| > |MergeStoreBench.setIntB |6087.703000 |2768.385000 | 119.90 %| > |MergeStoreBench.setIntBU |6733.813000 |2950.240000 | 128.25 %| > |MergeStoreBench.setIntBV |1362.233000 |1361.821000 | 0.03 %| > |MergeStoreBench.setIntL |2834.785000 |2833.042000 | 0.06 %| > |MergeStoreBench.setIntLU |2947.145000 |2946.874000 | 0.01 %| > |MergeStoreBench.setIntLV |5506.791000 |5506.229000 | 0.01 %| > |MergeStoreBench.setIntRB |7634.279000 |5611.058000 | 36.06 %| > |MergeStoreBench.setIntRBU |7766.737000 |5551.281000 | 39.91 %| > |MergeStoreBench.setIntRL |5689.793000 |5689.385000 | 0.01 %| > |MergeStoreBench.setIntRLU |5628.287000 |5628.789000 | -0.01 %| > |MergeStoreBench.setIntRU |5536.039000 |5534.910000 | 0.02 %| > |MergeStoreBench.setIntU |5595.363000 |5567.810000 | 0.49 %| > |MergeStoreBench.setLongB |13722.671000 |6811.098000 | 101.48 %| > |MergeStoreBench.setLongBU |13728.844000 |4280.240000 | 220.75 %| > |MergeStoreBench.setLongBV |2785.255000 |2785.949000 | -0.02 %| > |MergeStoreBench.setLongL |5714.615000 |5710.402000 | 0.07 %| > |MergeStoreBench.setLongLU |4128.746000 |4129.324000 | -0.01 %| > |MergeStoreBench.setLongLV |2793.125000 |2794.438000 | -0.05 %| > |MergeStoreBench.setLongRB |14465.223000 |7015.050000 | 106.20 %| > |MergeStoreBench.setLongRBU |14546.954000 |6173.210000 | 135.65 %| > |MergeStoreBench.setLongRL |6816.145000 |6813.348000 | 0.04 %| > |MergeStoreBench.setLongRLU |4289.445000 |4284.239000 | 0.12 %| > |MergeStoreBench.setLongRU |3132.471000 |3133.093000 | -0.02 %| > |MergeStoreBench.setLongU |3086.779000 |3087.298000 | -0.02 %| > > AMD EPYC 9T24 > |name | before | after | ratio | > |---|---|---|---| > |MergeStoreBench.setChar... kuaiwei has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 19 commits: - Merge remote-tracking branch 'origin/master' into pr/merge_stores_reverse - Add readable comment - Fix for review comments - Allow ValueOrder::Reverse on big-endian platforms - Revert "Merge more stores" This reverts commit 1e1113ed02ec5a9fe181f215d5667e8de487fe47. - Revert "Fix test502aBE" This reverts commit f773fa368577c4f67957c4d40968c5c45e3ae205. - Fix test502aBE - Merge more stores - Remove an useless assertion - Remove tailing white space - ... and 9 more: https://git.openjdk.org/jdk/compare/aac9cb45...b3243a56 ------------- Changes: https://git.openjdk.org/jdk/pull/23030/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23030&range=12 Stats: 226 lines in 3 files changed: 142 ins; 14 del; 70 mod Patch: https://git.openjdk.org/jdk/pull/23030.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23030/head:pull/23030 PR: https://git.openjdk.org/jdk/pull/23030 From duke at openjdk.org Wed Feb 26 12:24:56 2025 From: duke at openjdk.org (kuaiwei) Date: Wed, 26 Feb 2025 12:24:56 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v12] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 10:58:32 GMT, kuaiwei wrote: >> This patch enhance MergeStores optimization to support merge value with reverse byte order. >> >> Below is benchmark result before and after the patch: >> >> On aliyun g8y (aarch64) >> |name | before | score2 | ratio | >> |---|---|---|---| >> |MergeStoreBench.setCharBS |5669.655000 |5669.566000 | 0.00 %| >> |MergeStoreBench.setCharBV |5516.911000 |5516.273000 | 0.01 %| >> |MergeStoreBench.setCharC |5578.644000 |5552.809000 | 0.47 %| >> |MergeStoreBench.setCharLS |5782.140000 |5779.264000 | 0.05 %| >> |MergeStoreBench.setCharLV |5496.403000 |5499.195000 | -0.05 %| >> |MergeStoreBench.setIntB |6087.703000 |2768.385000 | 119.90 %| >> |MergeStoreBench.setIntBU |6733.813000 |2950.240000 | 128.25 %| >> |MergeStoreBench.setIntBV |1362.233000 |1361.821000 | 0.03 %| >> |MergeStoreBench.setIntL |2834.785000 |2833.042000 | 0.06 %| >> |MergeStoreBench.setIntLU |2947.145000 |2946.874000 | 0.01 %| >> |MergeStoreBench.setIntLV |5506.791000 |5506.229000 | 0.01 %| >> |MergeStoreBench.setIntRB |7634.279000 |5611.058000 | 36.06 %| >> |MergeStoreBench.setIntRBU |7766.737000 |5551.281000 | 39.91 %| >> |MergeStoreBench.setIntRL |5689.793000 |5689.385000 | 0.01 %| >> |MergeStoreBench.setIntRLU |5628.287000 |5628.789000 | -0.01 %| >> |MergeStoreBench.setIntRU |5536.039000 |5534.910000 | 0.02 %| >> |MergeStoreBench.setIntU |5595.363000 |5567.810000 | 0.49 %| >> |MergeStoreBench.setLongB |13722.671000 |6811.098000 | 101.48 %| >> |MergeStoreBench.setLongBU |13728.844000 |4280.240000 | 220.75 %| >> |MergeStoreBench.setLongBV |2785.255000 |2785.949000 | -0.02 %| >> |MergeStoreBench.setLongL |5714.615000 |5710.402000 | 0.07 %| >> |MergeStoreBench.setLongLU |4128.746000 |4129.324000 | -0.01 %| >> |MergeStoreBench.setLongLV |2793.125000 |2794.438000 | -0.05 %| >> |MergeStoreBench.setLongRB |14465.223000 |7015.050000 | 106.20 %| >> |MergeStoreBench.setLongRBU |14546.954000 |6173.210000 | 135.65 %| >> |MergeStoreBench.setLongRL |6816.145000 |6813.348000 | 0.04 %| >> |MergeStoreBench.setLongRLU |4289.445000 |4284.239000 | 0.12 %| >> |MergeStoreBench.setLongRU |3132.471000 |3133.093000 | -0.02 %| >> |MergeStoreBench.setLongU |3086.779000 |3087.298000 | -0.02 %| >> >> AMD EPYC 9T24 >> ... > > kuaiwei has updated the pull request incrementally with one additional commit since the last revision: > > Fix for review comments Merged with master ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2684807329 From dlunden at openjdk.org Wed Feb 26 12:34:49 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 26 Feb 2025 12:34:49 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v9] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/76d182e4..0a5f5c84 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=07-08 Stats: 73 lines in 8 files changed: 29 ins; 6 del; 38 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From mbaesken at openjdk.org Wed Feb 26 13:10:52 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Wed, 26 Feb 2025 13:10:52 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Wed, 26 Feb 2025 09:23:25 GMT, Matthias Baesken wrote: > When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. > AIX crash is (linux ppc64le crash is similar) : > > > # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 > # Error: ShouldNotReachHere() > # > > iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) > lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) > sp: 0x000000011023aab0 (base - 0x2DD8) > rtoc: 0x08001000a0088ff0 > |---stackaddr----| |----lrsave------|: > 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) > 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) > 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) > 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) > 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) > 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) > 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bc50 - 0x090000... Hi @offamitkumar , s390x might have the same problem we face on ppc64, see here https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/s390/compiledIC_s390.cpp#L58 Maybe you want to check this. On ppc64, the previously C2-only coding in `CompiledDirectCall::emit_to_interp_stub` was needed for non-C2 JVMs like minimal JVM (where only C1 is present). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23794#issuecomment-2684904878 From dlunden at openjdk.org Wed Feb 26 13:15:53 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 26 Feb 2025 13:15:53 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v10] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Remove accidental leftover #endif ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/0a5f5c84..e370f61f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=08-09 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From mablakatov at openjdk.org Wed Feb 26 14:54:45 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 26 Feb 2025 14:54:45 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms > IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms > LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms > > > Fujitsu A64FX (SVE 512-bit): > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms > IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms > LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision: - fixup: don't modify the value in vsrc Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this change, the result of recursive folding is held in vtmp1. To be able to pass this intermediate result to reduce_mul_integral_le128b(), we would have to use another temporary FloatRegister, as vtmp1 would essentially act as vsrc. It's possible to get around this however: reduce_mul_integral_le128b() is modified so it's possible to pass matching vsrc and vtmp2 arguments. By doing this, we save ourselves a temporary register in rules that match to reduce_mul_integral_gt128b(). - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23181/files - new: https://git.openjdk.org/jdk/pull/23181/files/c9dcc45f..3fc989bd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=01-02 Stats: 67 lines in 1 file changed: 35 ins; 17 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/23181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181 PR: https://git.openjdk.org/jdk/pull/23181 From kbarrett at openjdk.org Wed Feb 26 15:16:49 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 26 Feb 2025 15:16:49 GMT Subject: RFR: 8345492: Fix -Wzero-as-null-pointer-constant warnings in adlc code Message-ID: <3SzBZUBz0SaRp9F7y0BX7WMqm_MkuHodLh8erPRYuWk=.a47cfc08-1c25-4b61-b7d2-3dd840e3b488@github.com> Please review this trivial change to adlc, to use nullptr instead of literal 0 as a null pointer constant. Testing: mach5 tier1 Locally tested (linux-x64) with -Wzero-as-null-pointer-constant enabled to verify the warnings associated with this code were removed. ------------- Commit messages: - fix warnings in adlc Changes: https://git.openjdk.org/jdk/pull/23804/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23804&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8345492 Stats: 9 lines in 2 files changed: 0 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23804.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23804/head:pull/23804 PR: https://git.openjdk.org/jdk/pull/23804 From jkarthikeyan at openjdk.org Wed Feb 26 15:31:12 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 26 Feb 2025 15:31:12 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v4] In-Reply-To: <7tc3Q6VBD2QGu2tstDrVGICIMzofeN0docMxH9bVblQ=.b6453da8-ad4b-40a4-9f72-0d48a11d5d96@github.com> References: <7tc3Q6VBD2QGu2tstDrVGICIMzofeN0docMxH9bVblQ=.b6453da8-ad4b-40a4-9f72-0d48a11d5d96@github.com> Message-ID: On Tue, 25 Feb 2025 13:05:13 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/mulnode.cpp line 996: >> >>> 994: >>> 995: if (con0 + con1 >= nbits) { >>> 996: return ConNode::make(TypeInteger::zero(bt)); >> >> It'd be clearer to do this, which is more equivalent but more concise: >> Suggestion: >> >> return phase->zerocon(bt); > > Actually, this is not equivalent and incorrect. I've did this exact mistake in an earlier version. The problem is that `zerocon` caches the nodes: > https://github.com/openjdk/jdk/blob/8cfebc41dc8ec7b0d24d9c467b91de82d28b73fc/src/hotspot/share/opto/phaseX.cpp#L654-L656 > So, then, we likely (or at least may) return an old node, which is not legal: `Ideal` is only allowed to return `this`, `nullptr` or a new node. But yes, it's unfortunate because it'd be much lighter to read. Ah, that's tricky! But that makes a lot of sense to me. Maybe it could be worth adding a comment there to mention the fact. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1971822108 From jkarthikeyan at openjdk.org Wed Feb 26 15:31:57 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 26 Feb 2025 15:31:57 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v6] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 02:25:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Add Vector API Test Sounds good! I think I need a re-review from a Reviewer for the latest commit before integrating. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23579#issuecomment-2685413514 From duke at openjdk.org Wed Feb 26 15:33:53 2025 From: duke at openjdk.org (Marc Chevalier) Date: Wed, 26 Feb 2025 15:33:53 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v4] In-Reply-To: References: <7tc3Q6VBD2QGu2tstDrVGICIMzofeN0docMxH9bVblQ=.b6453da8-ad4b-40a4-9f72-0d48a11d5d96@github.com> Message-ID: On Wed, 26 Feb 2025 15:28:03 GMT, Jasmine Karthikeyan wrote: >> Actually, this is not equivalent and incorrect. I've did this exact mistake in an earlier version. The problem is that `zerocon` caches the nodes: >> https://github.com/openjdk/jdk/blob/8cfebc41dc8ec7b0d24d9c467b91de82d28b73fc/src/hotspot/share/opto/phaseX.cpp#L654-L656 >> So, then, we likely (or at least may) return an old node, which is not legal: `Ideal` is only allowed to return `this`, `nullptr` or a new node. But yes, it's unfortunate because it'd be much lighter to read. > > Ah, that's tricky! But that makes a lot of sense to me. Maybe it could be worth adding a comment there to mention the fact. You're right, it's tricky enough to be worth it. I will add a comment, good idea. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1971829030 From kvn at openjdk.org Wed Feb 26 15:53:00 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Feb 2025 15:53:00 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity [v2] In-Reply-To: References: Message-ID: <2STTWeawrumlBe31AyHrOt5NauW3vV4fIm4RS6tJNuA=.f503e389-adf1-4939-a188-c4d51b611389@github.com> On Wed, 26 Feb 2025 08:36:27 GMT, Andrew Dinn wrote: >> The compiler blob base size needs increasing in case the JDK is built without ZGC. The increment when ZGC is used can be comparably decreased. The final blob size increment when ZGC is included is over generous and can also be decreased. > > Andrew Dinn has updated the pull request incrementally with two additional commits since the last revision: > > - another format fix > - format fix Good. I will run this through our builds testing. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23776#pullrequestreview-2644996183 PR Comment: https://git.openjdk.org/jdk/pull/23776#issuecomment-2685473797 From dfenacci at openjdk.org Wed Feb 26 16:36:58 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Wed, 26 Feb 2025 16:36:58 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v4] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 12:59:21 GMT, Marc Chevalier wrote: >> This collapses double shift lefts by constants in a single constant: (x << con1) << con2 => x << (con1 + con2). Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. >> >> Moreover, the simplification logic of the sign extension trick had to be improved. For instance, we use `(x << 16) >> 16` to convert a 32 bits into a 16 bits integer, with sign extension. When storing this into a 16-bit field, this can be simplified into simple `x`. But in the case where `x` is itself a left-shift expression, say `y << 3`, this PR makes the IR looks like `(y << 19) >> 16` instead of the old `((y << 3) << 16) >> 16`. The former logic didn't handle the case where the left and the right shift have different magnitude. In this PR, I generalize this simplification to cases where the left shift has a larger magnitude than the right shift. This improvement was needed not to miss vectorization opportunities: without the simplification, we have a left shift and a right shift instead of a single left shift, which confuses the type inference. >> >> This also works for multiplications by powers of 2 since they are already translated into shifts. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > comment Cool improvement @marc-chevalier! Thanks a lot! test/hotspot/jtreg/compiler/c2/irTests/LShiftINodeIdealizationTests.java line 255: > 253: res[0] = (short) (a[0] << 3); > 254: return res; > 255: } In the comment of method `StoreNode::Ideal_sign_extended_input` you mention shifting left by more than 24. Do you think it would be possible to have a test for that case too (shifting left by more than 16 in this case I guess)? ------------- PR Review: https://git.openjdk.org/jdk/pull/23728#pullrequestreview-2645128722 PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1971942835 From kvn at openjdk.org Wed Feb 26 16:38:03 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Feb 2025 16:38:03 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception In-Reply-To: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: On Tue, 25 Feb 2025 10:11:54 GMT, Marc Chevalier wrote: > As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. > > The obvious fix that is > > Cell limit = local(_outer->max_locals()); > for (Cell c = start_cell(); c < limit; c = next_cell(c)) { > > since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course > > Cell limit = (Cell)(_outer->max_locals()); > > would work, but it seems to break (the very light) abstraction. > > I've also added an assert to transform the UB into a clear failure. > > This fix makes the UB warning go away on Mac with arm64. > > Thanks, > Marc I vote for first suggestion (and change `<=` to `<`). Second suggestion may prevent some C++ range elimination, constant folding in the loop. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23772#issuecomment-2685586878 PR Comment: https://git.openjdk.org/jdk/pull/23772#issuecomment-2685589405 From kvn at openjdk.org Wed Feb 26 16:45:57 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Feb 2025 16:45:57 GMT Subject: RFR: 8345492: Fix -Wzero-as-null-pointer-constant warnings in adlc code In-Reply-To: <3SzBZUBz0SaRp9F7y0BX7WMqm_MkuHodLh8erPRYuWk=.a47cfc08-1c25-4b61-b7d2-3dd840e3b488@github.com> References: <3SzBZUBz0SaRp9F7y0BX7WMqm_MkuHodLh8erPRYuWk=.a47cfc08-1c25-4b61-b7d2-3dd840e3b488@github.com> Message-ID: On Wed, 26 Feb 2025 15:11:25 GMT, Kim Barrett wrote: > Please review this trivial change to adlc, to use nullptr instead of literal 0 > as a null pointer constant. > > Testing: mach5 tier1 > Locally tested (linux-x64) with -Wzero-as-null-pointer-constant enabled to > verify the warnings associated with this code were removed. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23804#pullrequestreview-2645154930 From amitkumar at openjdk.org Wed Feb 26 16:49:54 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 26 Feb 2025 16:49:54 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Wed, 26 Feb 2025 09:23:25 GMT, Matthias Baesken wrote: > When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. > AIX crash is (linux ppc64le crash is similar) : > > > # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 > # Error: ShouldNotReachHere() > # > > iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) > lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) > sp: 0x000000011023aab0 (base - 0x2DD8) > rtoc: 0x08001000a0088ff0 > |---stackaddr----| |----lrsave------|: > 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) > 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) > 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) > 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) > 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) > 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) > 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bc50 - 0x090000... Hi Matthias, You mentioned that the crash happened during the build ? I just checked and at least the build is fine on s390x. This is what I am using for configuration: bash configure \ --with-boot-jdk=$HOME/jjboot_jdk_23 \ --with-jtreg=$HOME/jtreg \ --with-gtest=$HOME/googletest \ --with-jmh=build/jmh/jars \ --with-debug-level=fastdebug \ --with-jvm-variants=minimal \ --with-native-debug-symbols=internal \ --disable-precompiled-headers ------------- PR Comment: https://git.openjdk.org/jdk/pull/23794#issuecomment-2685620359 From duke at openjdk.org Wed Feb 26 17:18:29 2025 From: duke at openjdk.org (Johannes Graham) Date: Wed, 26 Feb 2025 17:18:29 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v37] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: consistency ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/6d60ae2a..4a8840c9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=36 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=35-36 Stats: 4 lines in 2 files changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From kvn at openjdk.org Wed Feb 26 18:20:59 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Feb 2025 18:20:59 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity [v2] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 08:36:27 GMT, Andrew Dinn wrote: >> The compiler blob base size needs increasing in case the JDK is built without ZGC. The increment when ZGC is used can be comparably decreased. The final blob size increment when ZGC is included is over generous and can also be decreased. > > Andrew Dinn has updated the pull request incrementally with two additional commits since the last revision: > > - another format fix > - format fix Our builds and tier1 testing passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23776#issuecomment-2685843357 From roland at openjdk.org Wed Feb 26 19:34:04 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 26 Feb 2025 19:34:04 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: > > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - stall -> delay, plus some more comments > - adjust selector if probability > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - remove multiversion mark if we break the structure > - register opaque with igvn > - copyright and rm CFG check > - IR rules for all cases > - 3 test versions > - test changed to unaligned ints > - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292 Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2645658428 From duke at openjdk.org Wed Feb 26 19:37:44 2025 From: duke at openjdk.org (Johannes Graham) Date: Wed, 26 Feb 2025 19:37:44 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v38] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: add test of random ranges ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/4a8840c9..7658fc9a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=37 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=36-37 Stats: 90 lines in 1 file changed: 89 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From duke at openjdk.org Wed Feb 26 19:40:56 2025 From: duke at openjdk.org (Johannes Graham) Date: Wed, 26 Feb 2025 19:40:56 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v33] In-Reply-To: <2lzeygtsSsYdvG3kVtwe70o3y6lbUDmJjNCHvg4mE80=.f15e0732-7ef2-48f7-9ae9-5ceae645a3fe@github.com> References: <2lzeygtsSsYdvG3kVtwe70o3y6lbUDmJjNCHvg4mE80=.f15e0732-7ef2-48f7-9ae9-5ceae645a3fe@github.com> Message-ID: <_LENf7_DRI-_egyRjObzVyq2u-C9uccrkvAMOu3sagE=.df4573dc-836a-4ed1-84db-d2e570a0bc62@github.com> On Wed, 26 Feb 2025 06:51:44 GMT, Emanuel Peter wrote: >> Also, as you commented on somewhere above, having a way to target Types from a stand-alone gtest, would also be a really nice way of making things more testable. > >> For example the "pow2" tests are there because that was a scenario that caused some trouble. I think there is a relatively small set of "interesting" cases that are nice to cover with something deterministic. (I could add a few more tests with hard-coded interesting values). > > I know that there are only few "interesting" cases. That's why I came up with `Generators` so we would be more likely to hit the interesting cases. > > Of course we want to have a good number of deterministic cases. But extending that with some randomized tests that would find bugs over a longer time span are still valuable. > > The issue is often that there are edge cases, and humans are not very good at finding them. Thus with randomization we would at least eventually find those cases - hopefully. And in my experience I did find bugs with randomized tests. Sometimes we found bugs that were not related to the issue they were created for originally, but they found some other issue - and that is valuable too. > >> It feels disproportionate to do something more complicated for this PR. > > It's really not that complicated. All I'd be asking for is the test in the form that I provided above ;) I've added the test. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23089#discussion_r1972268860 From duke at openjdk.org Wed Feb 26 19:52:39 2025 From: duke at openjdk.org (Johannes Graham) Date: Wed, 26 Feb 2025 19:52:39 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v39] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: update bug numbers and summary ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/7658fc9a..f5674420 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=38 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=37-38 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From dlong at openjdk.org Wed Feb 26 20:12:51 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 26 Feb 2025 20:12:51 GMT Subject: RFR: 8345492: Fix -Wzero-as-null-pointer-constant warnings in adlc code In-Reply-To: <3SzBZUBz0SaRp9F7y0BX7WMqm_MkuHodLh8erPRYuWk=.a47cfc08-1c25-4b61-b7d2-3dd840e3b488@github.com> References: <3SzBZUBz0SaRp9F7y0BX7WMqm_MkuHodLh8erPRYuWk=.a47cfc08-1c25-4b61-b7d2-3dd840e3b488@github.com> Message-ID: <39DAtelp9vBg6xbJ6fPLvVJs_sR1U-9Q3SwMhnMZvNk=.b262cfd9-cc81-4e2b-b92d-aaea0a887cd7@github.com> On Wed, 26 Feb 2025 15:11:25 GMT, Kim Barrett wrote: > Please review this trivial change to adlc, to use nullptr instead of literal 0 > as a null pointer constant. > > Testing: mach5 tier1 > Locally tested (linux-x64) with -Wzero-as-null-pointer-constant enabled to > verify the warnings associated with this code were removed. Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23804#pullrequestreview-2645767681 From jbhateja at openjdk.org Wed Feb 26 20:50:51 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 26 Feb 2025 20:50:51 GMT Subject: RFR: 8346236: Auto vectorization support for various Float16 operations Message-ID: This is a follow-up PR for https://github.com/openjdk/jdk/pull/22754 The patch adds support to vectorize various float16 scalar operations (add/subtract/divide/multiply/sqrt/fma). Summary of changes included with the patch: 1. C2 compiler New Vector IR creation. 2. Auto-vectorization support. 3. x86 backend implementation. 4. New IR verification test for each newly supported vector operation. Following are the performance numbers of Float16OperationsBenchmark System : Intel(R) Xeon(R) Processor code-named Granite rapids Frequency fixed at 2.5 GHz Baseline Benchmark (vectorDim) Mode Cnt Score Error Units Float16OperationsBenchmark.absBenchmark 1024 thrpt 2 4191.787 ops/ms Float16OperationsBenchmark.addBenchmark 1024 thrpt 2 1211.978 ops/ms Float16OperationsBenchmark.cosineSimilarityDequantizedFP16 1024 thrpt 2 493.026 ops/ms Float16OperationsBenchmark.cosineSimilarityDoubleRoundingFP16 1024 thrpt 2 612.430 ops/ms Float16OperationsBenchmark.cosineSimilaritySingleRoundingFP16 1024 thrpt 2 616.012 ops/ms Float16OperationsBenchmark.divBenchmark 1024 thrpt 2 604.882 ops/ms Float16OperationsBenchmark.dotProductFP16 1024 thrpt 2 410.798 ops/ms Float16OperationsBenchmark.euclideanDistanceDequantizedFP16 1024 thrpt 2 602.863 ops/ms Float16OperationsBenchmark.euclideanDistanceFP16 1024 thrpt 2 640.348 ops/ms Float16OperationsBenchmark.fmaBenchmark 1024 thrpt 2 809.175 ops/ms Float16OperationsBenchmark.getExponentBenchmark 1024 thrpt 2 2682.764 ops/ms Float16OperationsBenchmark.isFiniteBenchmark 1024 thrpt 2 3373.901 ops/ms Float16OperationsBenchmark.isFiniteCMovBenchmark 1024 thrpt 2 1881.652 ops/ms Float16OperationsBenchmark.isFiniteStoreBenchmark 1024 thrpt 2 2273.745 ops/ms Float16OperationsBenchmark.isInfiniteBenchmark 1024 thrpt 2 2147.913 ops/ms Float16OperationsBenchmark.isInfiniteCMovBenchmark 1024 thrpt 2 1962.579 ops/ms Float16OperationsBenchmark.isInfiniteStoreBenchmark 1024 thrpt 2 1696.494 ops/ms Float16OperationsBenchmark.isNaNBenchmark 1024 thrpt 2 2417.396 ops/ms Float16OperationsBenchmark.isNaNCMovBenchmark 1024 thrpt 2 1708.585 ops/ms Float16OperationsBenchmark.isNaNStoreBenchmark 1024 thrpt 2 2055.511 ops/ms Float16OperationsBenchmark.maxBenchmark 1024 thrpt 2 1211.940 ops/ms Float16OperationsBenchmark.minBenchmark 1024 thrpt 2 1212.063 ops/ms Float16OperationsBenchmark.mulBenchmark 1024 thrpt 2 1211.955 ops/ms Float16OperationsBenchmark.negateBenchmark 1024 thrpt 2 4215.922 ops/ms Float16OperationsBenchmark.sqrtBenchmark 1024 thrpt 2 337.606 ops/ms Float16OperationsBenchmark.subBenchmark 1024 thrpt 2 1212.467 ops/ms Withopt: Benchmark (vectorDim) Mode Cnt Score Error Units Float16OperationsBenchmark.absBenchmark 1024 thrpt 2 28481.336 ops/ms Float16OperationsBenchmark.addBenchmark 1024 thrpt 2 21311.633 ops/ms Float16OperationsBenchmark.cosineSimilarityDequantizedFP16 1024 thrpt 2 489.324 ops/ms Float16OperationsBenchmark.cosineSimilarityDoubleRoundingFP16 1024 thrpt 2 592.947 ops/ms Float16OperationsBenchmark.cosineSimilaritySingleRoundingFP16 1024 thrpt 2 616.415 ops/ms Float16OperationsBenchmark.divBenchmark 1024 thrpt 2 1991.958 ops/ms Float16OperationsBenchmark.dotProductFP16 1024 thrpt 2 586.924 ops/ms Float16OperationsBenchmark.euclideanDistanceDequantizedFP16 1024 thrpt 2 747.626 ops/ms Float16OperationsBenchmark.euclideanDistanceFP16 1024 thrpt 2 635.823 ops/ms Float16OperationsBenchmark.fmaBenchmark 1024 thrpt 2 15722.304 ops/ms Float16OperationsBenchmark.getExponentBenchmark 1024 thrpt 2 2685.930 ops/ms Float16OperationsBenchmark.isFiniteBenchmark 1024 thrpt 2 3455.726 ops/ms Float16OperationsBenchmark.isFiniteCMovBenchmark 1024 thrpt 2 2026.590 ops/ms Float16OperationsBenchmark.isFiniteStoreBenchmark 1024 thrpt 2 2265.065 ops/ms Float16OperationsBenchmark.isInfiniteBenchmark 1024 thrpt 2 2140.280 ops/ms Float16OperationsBenchmark.isInfiniteCMovBenchmark 1024 thrpt 2 2026.135 ops/ms Float16OperationsBenchmark.isInfiniteStoreBenchmark 1024 thrpt 2 1340.694 ops/ms Float16OperationsBenchmark.isNaNBenchmark 1024 thrpt 2 2432.249 ops/ms Float16OperationsBenchmark.isNaNCMovBenchmark 1024 thrpt 2 1710.044 ops/ms Float16OperationsBenchmark.isNaNStoreBenchmark 1024 thrpt 2 2055.544 ops/ms Float16OperationsBenchmark.maxBenchmark 1024 thrpt 2 22170.178 ops/ms Float16OperationsBenchmark.minBenchmark 1024 thrpt 2 21735.692 ops/ms Float16OperationsBenchmark.mulBenchmark 1024 thrpt 2 22235.991 ops/ms Float16OperationsBenchmark.negateBenchmark 1024 thrpt 2 27733.529 ops/ms Float16OperationsBenchmark.sqrtBenchmark 1024 thrpt 2 1770.878 ops/ms Float16OperationsBenchmark.subBenchmark 1024 thrpt 2 21800.058 ops/ms Java implementation of Float16.isNaN is not auto-vectorizer friendly, existence of multiple conditional expressions prevents inferring conditional compare IR, while vectorization of Java implementation of Float16.isFinite and Float16.isInfinite APIs are possible on inferring VectorBlend for a contiguous pack of CMoveI IR in the presence of -XX:+UseVectorCmov and -XX:+UseCMoveUnconditionally runtime flags, we plan to optimize these APIs through scalar intrinsification and subsequent auto-vectorization support in a subsequent patch. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - Updating benchmark - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8346236 - Updating copyright - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8346236 - Add MinVHF/MaxVHF to commutative op list - Auto Vectorization support for Float16 operations. Changes: https://git.openjdk.org/jdk/pull/22755/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22755&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8346236 Stats: 864 lines in 16 files changed: 801 ins; 10 del; 53 mod Patch: https://git.openjdk.org/jdk/pull/22755.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22755/head:pull/22755 PR: https://git.openjdk.org/jdk/pull/22755 From duke at openjdk.org Wed Feb 26 21:51:47 2025 From: duke at openjdk.org (Johannes Graham) Date: Wed, 26 Feb 2025 21:51:47 GMT Subject: RFR: 8347645: C2: XOR bounded value handling blocks constant folding [v40] In-Reply-To: References: Message-ID: > An interaction between xor bounds optimization and constant folding resulted in xor over constants not being optimized. This has a noticeable effect on `Long.expand` with a constant mask, on architectures that don't have instructions equivalent to `PDEP` to be used in an intrinsic. > > This change moves logic from the `Xor(L|I)Node::Value` methods into the `add_ring` methods, and gives priority to constant-folding. A static method was separated out to facilitate direct unit-testing. It also (subjectively) simplified the calculation of the upper bound and added an explanation of the reasoning behind it. > > In addition to testing for constant folding over xor, IR tests were added to `XorINodeIdealizationTests` and `XorLNodeIdealizationTests` to cover these related items: > - Bounds optimization of xor > - A check for `x ^ x = 0` > - Explicit testing of xor over booleans. > > Also `test_xor_node.cpp` was added to more extensively test the correctness of the bounds optimization. It exhaustively tests ranges of 4-bit numbers as well as at the high and low end of the affected types. > > --------- > ### Progress > - [x] Change must not contain extraneous whitespace > - [x] Commit message must refer to an issue > - [ ] Change must be properly reviewed (2 reviews required, with at least 2 [Reviewers](https://openjdk.org/bylaws#reviewer)) > > > > ### Reviewers > * [Quan Anh Mai](https://openjdk.org/census#qamai) (@merykitty - Committer) ? Re-review required (review applies to [cf779497](https://git.openjdk.org/jdk/pull/23089/files/cf77949776f7a4601268c7291a5743c2eb164186)) > > ### Reviewing >
Using git > > Checkout this PR locally: \ > `$ git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089` \ > `$ git checkout pull/23089` > > Update a local copy of the PR: \ > `$ git checkout pull/23089` \ > `$ git pull https://git.openjdk.org/jdk.git pull/23089/head` > >
>
Using Skara CLI tools > > Checkout this PR locally: \ > `$ git pr checkout 23089` > > View PR using the GUI difftool: \ > `$ git pr show -t 23089` > >
>
Using diff file > > Download this PR as a diff file: \ > https://git.openjdk.org/jdk/pull/23089.diff > >
>
Using Webrev > > [Link to Webrev Comment](https://git.openjdk.org/jdk/pull/23089#issuecomment-2593992282) >
Johannes Graham has updated the pull request incrementally with one additional commit since the last revision: invert comparison in tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23089/files - new: https://git.openjdk.org/jdk/pull/23089/files/f5674420..ec17fd24 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=39 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23089&range=38-39 Stats: 16 lines in 1 file changed: 0 ins; 0 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/23089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23089/head:pull/23089 PR: https://git.openjdk.org/jdk/pull/23089 From bulasevich at openjdk.org Thu Feb 27 00:40:15 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 27 Feb 2025 00:40:15 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v12] In-Reply-To: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: > This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. > > The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. > > Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. > > The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): > - nmethod_count:134000, total_compilation_time: 510460ms > - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, > - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB > > Functional testing: jtreg on arm/aarch/x86. > Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. > > Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: returning oops back to nmethods. jtreg: Ok, performance: Ok. todo: cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21276/files - new: https://git.openjdk.org/jdk/pull/21276/files/6c3370be..0dbd5029 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21276&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21276&range=10-11 Stats: 89 lines in 9 files changed: 34 ins; 35 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/21276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21276/head:pull/21276 PR: https://git.openjdk.org/jdk/pull/21276 From duke at openjdk.org Thu Feb 27 01:54:56 2025 From: duke at openjdk.org (Nicole Xu) Date: Thu, 27 Feb 2025 01:54:56 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 10:43:36 GMT, Emanuel Peter wrote: >> Hi @eme64, do you see any risks here? Would you please help to review the patch? Thanks. > > @xyyNicole @jatin-bhateja I think it is reasonable to just fix the benchmark so that it still has the same behaviour, just without the out-of-bounds exception. @jatin-bhateja you originally wrote the benchmark, and it could make sense if you fixed it up to what it should be more ideally. @xyyNicole I propose that we file a follow-up RFE to fix the benchmark, and just mention that issue in the benchmark. > > What do you think? Hi, @eme64, do you have any additional comments for this patch? Is it ready to be merged? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2686604173 From xgong at openjdk.org Thu Feb 27 01:59:00 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 27 Feb 2025 01:59:00 GMT Subject: RFR: 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException [v2] In-Reply-To: <3zzpTrqxv5KaBP-FKCAWjfffVonoWr9fKE6S8lO-cTY=.48f4cb20-f9e6-473f-8156-18d1694e7496@github.com> References: <3zzpTrqxv5KaBP-FKCAWjfffVonoWr9fKE6S8lO-cTY=.48f4cb20-f9e6-473f-8156-18d1694e7496@github.com> Message-ID: On Wed, 26 Feb 2025 07:04:58 GMT, Nicole Xu wrote: >> Suite `MaskedLogicOpts.maskedLogicOperationsLong512()` failed on both x86 and AArch64 with the following error: >> >> >> java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 >> >> >> The variable `long256_arr_idx` is misused when indexing `LongVector l2`, `l3`, `l4`, `l5` in function `maskedLogicOperationsLongKernel()` resulting in the IndexOutOfBoundsException error. On the other hand, the unified index for 128-bit, 256-bit and 512-bit species might not be proper since it leaves gaps in between when accessing the data for 128-bit and 256-bit species. This will unnecessarily include the noise due to cache misses or (on some targets) prefetching additional cache lines which are not usable, thereby impacting the crispness of microbenchmark. >> >> Hence, we improved the benchmark from several aspects, >> 1. Used sufficient number of predicated operations within the vector loop while minimizing the noise due to memory operations. >> 2. Modified the index computation logic which can now withstand any ARRAYLEN without resulting in an IOOBE. >> 3. Removed redundant vector read/writes to instance fields, thus eliminating significant boxing penalty which translates into throughput gains. > > Nicole Xu has updated the pull request incrementally with two additional commits since the last revision: > > - 8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException > > Suite MaskedLogicOpts.maskedLogicOperationsLong512() failed on both x86 > and AArch64 with the following error: > > ``` > java.lang.IndexOutOfBoundsException: Index 252 out of bounds for length 249 > ``` > > The variable `long256_arr_idx` is misused when indexing `LongVector l2`, > `l3`, `l4`, `l5` in function `maskedLogicOperationsLongKernel()` > resulting in the IndexOutOfBoundsException error. On the other hand, the > unified index for 128-bit, 256-bit and 512-bit species might not be > proper since it leaves gaps in between when accessing the data > for 128-bit and 256-bit species. This will unnecessarily include the > noise due to cache misses or (on some targets) prefetching additional > cache lines which are not usable, thereby impacting the crispness of > microbenchmark. > > Hence, we improved the benchmark from several aspects, > 1. Used sufficient number of predicated operations within the vector > loop while minimizing the noise due to memory operations. > 2. Modified the index computation logic which can now withstand any > ARRAYLEN without resulting in an IOOBE. > 3. Removed redundant vector read/writes to instance fields, thus > eliminating significant boxing penalty which translates into throughput > gains. > > Change-Id: Ie8a9d495b1ca5e36f1eae069ff70a815a2de00c0 > - Revert "8346954: [JMH] jdk.incubator.vector.MaskedLogicOpts fails due to IndexOutOfBoundsException" > > This reverts commit 083bedec04d5ab78a420e156e74c1257ce30aee8. Still looks good to me! ------------- PR Comment: https://git.openjdk.org/jdk/pull/22963#issuecomment-2686608157 From xgong at openjdk.org Thu Feb 27 02:06:52 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 27 Feb 2025 02:06:52 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: <-exSdNf1CuxqYL--Mi4-L1m2Gop9bPIvdgqQEpAUIeM=.5f4936a7-31d4-45b7-bddf-e973b3687c18@github.com> References: <-exSdNf1CuxqYL--Mi4-L1m2Gop9bPIvdgqQEpAUIeM=.5f4936a7-31d4-45b7-bddf-e973b3687c18@github.com> Message-ID: <2DXecFoDHdgQSnZFZ-gqmXRxXz0nU47Eg3clS5_q1bo=.822a4d7c-988c-4f08-ad22-a81cf9fd1484@github.com> On Mon, 24 Feb 2025 09:20:59 GMT, Bhavana Kilambi wrote: >> Yes, `bsl` only accepts 8B/16B, but it can also work for other types. We need to keep all bits of the lane to 1/0 (e.g. `[0xffffffffffffffff, 0x0000000000000000]` for `T2D` type). You can take the implementation of `VectorBlend` as a reference. >> >> BTW, I'm currently working on adding the vector rearrange support for 2D (i.e. 128-bit long/double vector) types, and I met the same issues. I have tested that using a pattern with `bsl` can implement the op. The main idea is 1) compare the shuffle input with an iota index vector, and 2) choose `src` input or `swap two elements in src` based on the comparing result with `bsl`. Hope this could help you! > > Thank you for your inputs. I'll look into this. Hi @Bhavana-Kilambi , I'v created a new PR https://github.com/openjdk/jdk/pull/23790 to implement the `VectorRearrange` for small lane count vector types like `2D`. I think the implementation is quite same with what we discussed here. Any feedback please let me know. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1972691040 From dlong at openjdk.org Thu Feb 27 02:29:52 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 27 Feb 2025 02:29:52 GMT Subject: RFR: 8350097: Make Compilation::current() and Compile::current() safer [v2] In-Reply-To: <4gkyBqTfGrrSMDXw96A3V9H-bLHV0D5VFrvmTzu6k3A=.c7094806-2113-4299-b5bf-1de73b9fb05a@github.com> References: <4ELV07PUQEFeOLgzqbV3OoGjHVny5paw0Gk0awuJ3h0=.99faedbd-4909-4d8e-93eb-75d5697e797f@github.com> <4gkyBqTfGrrSMDXw96A3V9H-bLHV0D5VFrvmTzu6k3A=.c7094806-2113-4299-b5bf-1de73b9fb05a@github.com> Message-ID: On Sat, 15 Feb 2025 07:14:24 GMT, Thomas Stuefe wrote: >> Somewhat trivial. >> >> I recently hunted a bug for an hour until I realized that I had accessed ciEnv::compiler_data() as C2 `Compile` when, in fact, it was C1 `Compilation`. Stupid mistake, but an assert is easy to do and saves time. > > Thomas Stuefe has updated the pull request incrementally with two additional commits since the last revision: > > - redo > - Revert "start" > > This reverts commit e370e14abf2ee25019ed13cde9edfa24047d982d. Since the callers already need to know which compiler they are asking about, I don't see the value in forcing void* through a single interface. How about we improve things by replacing compiler_data with c1_compiler_data() and c2_compiler_data()? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23635#issuecomment-2686695012 From dlong at openjdk.org Thu Feb 27 02:41:06 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 27 Feb 2025 02:41:06 GMT Subject: RFR: 8349563: Improve AbsNode::Value() for integer types In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 05:10:04 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a small patch that improves the implementation of Value() for `AbsINode` and `AbsLNode` by returning the absolute value of the input range. Most of the logic is trivial except for the special case where `_lo == jint_min/jlong_min` which must return the entire type range when encountered, for which I've added a small proof in the comments. I've also added some unit tests and updated the file to limit IR check platforms with more granularity. > > Thoughts and reviews would be appreciated! src/hotspot/share/opto/subnode.cpp line 1938: > 1936: > 1937: NativeType lo_abs = uabs(t->_lo); > 1938: NativeType hi_abs = uabs(t->_hi); Converting unsigned to signed is C++ Undefined Behavior, is it not? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23685#discussion_r1972735806 From xgong at openjdk.org Thu Feb 27 03:31:57 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 27 Feb 2025 03:31:57 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 14:54:45 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision: > > - fixup: don't modify the value in vsrc > > Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this > change, the result of recursive folding is held in vtmp1. To be able to > pass this intermediate result to reduce_mul_integral_le128b(), we would > have to use another temporary FloatRegister, as vtmp1 would essentially > act as vsrc. It's possible to get around this however: > reduce_mul_integral_le128b() is modified so it's possible to pass > matching vsrc and vtmp2 arguments. By doing this, we save ourselves a > temporary register in rules that match to reduce_mul_integral_gt128b(). > - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating src/hotspot/cpu/aarch64/aarch64_vector.ad line 3012: > 3010: vReg tmp1, vReg tmp2) %{ > 3011: predicate(Matcher::vector_length_in_bytes(n->in(2)) == 8 || > 3012: Matcher::vector_length_in_bytes(n->in(2)) == 16); Suggestion: predicate(Matcher::vector_length_in_bytes(n->in(2)) <= 16); src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2113: > 2111: while (vector_length_in_bytes > FloatRegister::neon_vl) { > 2112: do_recursive_folding_iteration(vtmp1, vtmp1, vtmp2); > 2113: } Looks a little complex. Could we just simplify with following change? BTW, `sve_movprfx` inside of the loop can be saved. Please correct me if any mis-understanding! Suggestion: sve_movprfx(vtmp2, vsrc); while (vector_length_in_bytes > FloatRegister::neon_vl) { unsigned vector_length = vector_length_in_bytes / type2aelembytes(bt); sve_gen_mask_imm(pgtmp, bt, vector_length / 2); // Shuffle the upper half elements of the register to the right. sve_ext(vtmp1, vtmp2, vector_length_in_bytes / 2); sve_mul(vtmp2, elemType_to_regVariant(bt), pgtmp, vtmp1); vector_length_in_bytes = vector_length_in_bytes / 2; } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972770786 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972768005 From haosun at openjdk.org Thu Feb 27 03:55:06 2025 From: haosun at openjdk.org (Hao Sun) Date: Thu, 27 Feb 2025 03:55:06 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: <2jvFY4hq9FPdk9e4Zg6LRPdRVhDTGgxofL-we8c-mns=.4e6ce509-67a4-4e46-a661-2b0951f88731@github.com> References: <2jvFY4hq9FPdk9e4Zg6LRPdRVhDTGgxofL-we8c-mns=.4e6ce509-67a4-4e46-a661-2b0951f88731@github.com> Message-ID: On Tue, 4 Feb 2025 18:52:55 GMT, Emanuel Peter wrote: >> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision: >> >> - fixup: don't modify the value in vsrc >> >> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this >> change, the result of recursive folding is held in vtmp1. To be able to >> pass this intermediate result to reduce_mul_integral_le128b(), we would >> have to use another temporary FloatRegister, as vtmp1 would essentially >> act as vsrc. It's possible to get around this however: >> reduce_mul_integral_le128b() is modified so it's possible to pass >> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a >> temporary register in rules that match to reduce_mul_integral_gt128b(). >> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139: > >> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD >> 2138: // instructions are used. >> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc, > > Drive-by question: > This is recursive folding: take halve the vector and add it that way. > > What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences. I have the same concern about the order issue with @eme64. Should we only enable this only for VectorAPI case, which doesn't require strict-order? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972792220 From haosun at openjdk.org Thu Feb 27 03:55:06 2025 From: haosun at openjdk.org (Hao Sun) Date: Thu, 27 Feb 2025 03:55:06 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 14:54:45 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision: > > - fixup: don't modify the value in vsrc > > Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this > change, the result of recursive folding is held in vtmp1. To be able to > pass this intermediate result to reduce_mul_integral_le128b(), we would > have to use another temporary FloatRegister, as vtmp1 would essentially > act as vsrc. It's possible to get around this however: > reduce_mul_integral_le128b() is modified so it's possible to pass > matching vsrc and vtmp2 arguments. By doing this, we save ourselves a > temporary register in rules that match to reduce_mul_integral_gt128b(). > - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating src/hotspot/cpu/aarch64/aarch64_vector.ad line 3033: > 3031: format %{ "reduce_mulI_gt128b $dst, $isrc, $vsrc\t# vector (> 128 bits). KILL $tmp1, $tmp2, $pgtmp" %} > 3032: ins_encode %{ > 3033: BasicType bt = Matcher::vector_element_basic_type(this, $vsrc); I suggest adding `assert(UseSVE > 0, "must be sve");` assertion here and for the other three `*_gt128b` rules. src/hotspot/cpu/aarch64/aarch64_vector.ad line 3043: > 3041: %} > 3042: > 3043: instruct reduce_mulL_le128b(iRegLNoSp dst, iRegL isrc, vReg vsrc) %{ I suggest using `_128b` here since only `2L` is matched here. Suggestion: instruct reduce_mulL_128b(iRegLNoSp dst, iRegL isrc, vReg vsrc) %{ src/hotspot/cpu/aarch64/aarch64_vector.ad line 3055: > 3053: %} > 3054: > 3055: instruct reduce_mulL_gt128b(iRegLNoSp dst, iRegL isrc, vReg vsrc, vReg tmp1, nit: only one tmp vReg is used here Suggestion: instruct reduce_mulL_gt128b(iRegLNoSp dst, iRegL isrc, vReg vsrc, vReg tmp, src/hotspot/cpu/aarch64/aarch64_vector.ad line 3099: > 3097: %} > 3098: > 3099: instruct reduce_mulD_le128b(vRegD dst, vRegD dsrc, vReg vsrc, vReg tmp) %{ Similar to `long` type, I suggest using `_128b` as only `2D` is matched here. src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3: > 1: /* > 2: * Copyright (c) 1997, 2025, Oracle and/or its affiliates. All rights reserved. > 3: * Copyright (c) 2014, 2025, Red Hat Inc. All rights reserved. nit: I don't think the copyright year for Red Hat needs to be updated src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3677: > 3675: INSN(sve_nots, sve_eors); // Bitwise invert predicate, setting the condition flags; an alias of sve_eors > 3676: #undef INSN > 3677: These instructions are not used any more after the follow-up commit of using `EXT`. I suggest removing them. Besides, could you also share the benchmark data after using `EXT`? I don't have >=256-bit SVE on hand and cannot test that. Thanks. src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2002: > 2000: assert(vector_length_in_bytes == 8 || vector_length_in_bytes == 16, "unsupported"); > 2001: assert_different_registers(vtmp1, vsrc); > 2002: assert_different_registers(vtmp1, vtmp2); nit: would be neat to use Suggestion: assert_different_registers(vsrc, vtmp1, vtmp2); src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2087: > 2085: assert(vector_length_in_bytes > FloatRegister::neon_vl, "ASIMD impl should be used instead"); > 2086: assert(vector_length_in_bytes <= FloatRegister::sve_vl_max, "unsupported vector length"); > 2087: assert(is_power_of_2(vector_length_in_bytes), "unsupported vector length"); Better to compare with `MaxVectorSize`. I suggest using `assert(length_in_bytes == MaxVectorSize, "invalid vector length");` and putting this assertion in `aarch64_vector.ad` file, i.e. inside the matching rule. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972763554 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972764334 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972765345 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972764690 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972765990 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972770302 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972770928 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1972778855 From xgong at openjdk.org Thu Feb 27 06:58:33 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 27 Feb 2025 06:58:33 GMT Subject: RFR: 8350748: VectorAPI: Method "checkMaskFromIndexSize" should be force inlined Message-ID: <18Q2Zl2ip_eFS_Y4fflgS8XYBkbwCZ468DIjP3KwhDE=.240f4182-4b02-4fac-97c8-ac659427e4a8@github.com> Method `checkMaskFromIndexSize` is called by some vector masked APIs like `fromArray/intoArray/fromMemorySegment/...`. It is used to check whether the index of any active lanes in a mask will reach out of the boundary of the given Array/MemorySegment. This function should be force inlined, or a VectorMask object is generated once the function call is not inlined by C2 compiler, which affects the API performance a lot. This patch changed to call the `VectorMask.checkFromIndexSize` method directly inside of these APIs instead of `checkMaskFromIndexSize`. Since it has added the `@ForceInline` annotation already, it will be inlined and intrinsified by C2. And then the expected vector instructions can be generated. With this change, the unused `checkMaskFromIndexSize` can be removed. Performance of some JMH benchmarks can improve up to 14x on a NVIDIA Grace CPU (AArch64 SVE2, 128-bit vectors). We can also observe the similar performance improvement on a Intel CPU which supports AVX512. Following is the performance data on Grace: Benchmark Mode Cnt Units Before After Gain LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE thrpt 30 ops/ms 31544.304 31610.598 1.002 LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE thrpt 30 ops/ms 3896.202 3903.249 1.001 LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE thrpt 30 ops/ms 570.415 7174.320 12.57 LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE thrpt 30 ops/ms 566.694 7193.520 12.69 LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE thrpt 30 ops/ms 3899.269 3878.258 0.994 LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE thrpt 30 ops/ms 1134.301 16053.847 14.15 StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE thrpt 30 ops/ms 26449.558 28699.480 1.085 StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE thrpt 30 ops/ms 1922.167 5781.077 3.007 StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE thrpt 30 ops/ms 3784.190 11789.276 3.115 StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE thrpt 30 ops/ms 3694.082 15633.547 4.232 StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE thrpt 30 ops/ms 1966.956 6049.790 3.075 StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE thrpt 30 ops/ms 7647.309 27412.387 3.584 ------------- Commit messages: - 8350748: VectorAPI: Method "checkMaskFromIndexSize" should be force inlined Changes: https://git.openjdk.org/jdk/pull/23817/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23817&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350748 Stats: 213 lines in 7 files changed: 36 ins; 140 del; 37 mod Patch: https://git.openjdk.org/jdk/pull/23817.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23817/head:pull/23817 PR: https://git.openjdk.org/jdk/pull/23817 From epeter at openjdk.org Thu Feb 27 07:02:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Feb 2025 07:02:10 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin wrote: >>> Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass. >> >> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? >> >> I don't see it as super critical personally, as the slow_path is `delayed`, so no loop-opts are performed on it. The overhead is minimal if we keep it until after loop-opts, I think. But I'm not against trying. It would take a bit of effort to construct test cases where we have the loop fold away after multiversion_if is added, but that is probably possible. >> >> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: >> [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if >> >> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. >> >> @rwestrel What do you think? > >> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? > > Ok > >> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if >> >> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. > > I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. @rwestrel @vnkozlov Thank you for the reviews, and all the good questions, and ideas for follow-up RFE's ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2687071561 From epeter at openjdk.org Thu Feb 27 07:02:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Feb 2025 07:02:11 GMT Subject: Integrated: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter wrote: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... This pull request has now been integrated. Changeset: 885338b5 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/885338b5f38ed05d8b91efc0178b371f2f89310e Stats: 1089 lines in 27 files changed: 966 ins; 28 del; 95 mod 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory Reviewed-by: roland, kvn ------------- PR: https://git.openjdk.org/jdk/pull/22016 From dlong at openjdk.org Thu Feb 27 07:30:52 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 27 Feb 2025 07:30:52 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity [v2] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 08:36:27 GMT, Andrew Dinn wrote: >> The compiler blob base size needs increasing in case the JDK is built without ZGC. The increment when ZGC is used can be comparably decreased. The final blob size increment when ZGC is included is over generous and can also be decreased. > > Andrew Dinn has updated the pull request incrementally with two additional commits since the last revision: > > - another format fix > - format fix Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23776#pullrequestreview-2646840716 From duke at openjdk.org Thu Feb 27 07:54:34 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 27 Feb 2025 07:54:34 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v5] In-Reply-To: References: Message-ID: > This collapses double shift lefts by constants in a single constant: (x << con1) << con2 => x << (con1 + con2). Care must be taken in the case con1 + con2 is bigger than the number of bits in the integer type. In this case, we must simplify to 0. > > Moreover, the simplification logic of the sign extension trick had to be improved. For instance, we use `(x << 16) >> 16` to convert a 32 bits into a 16 bits integer, with sign extension. When storing this into a 16-bit field, this can be simplified into simple `x`. But in the case where `x` is itself a left-shift expression, say `y << 3`, this PR makes the IR looks like `(y << 19) >> 16` instead of the old `((y << 3) << 16) >> 16`. The former logic didn't handle the case where the left and the right shift have different magnitude. In this PR, I generalize this simplification to cases where the left shift has a larger magnitude than the right shift. This improvement was needed not to miss vectorization opportunities: without the simplification, we have a left shift and a right shift instead of a single left shift, which confuses the type inference. > > This also works for multiplications by powers of 2 since they are already translated into shifts. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: + comment on why not zerocon ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23728/files - new: https://git.openjdk.org/jdk/pull/23728/files/c7920903..b8c3d74f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23728&range=03-04 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23728.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23728/head:pull/23728 PR: https://git.openjdk.org/jdk/pull/23728 From fyang at openjdk.org Thu Feb 27 08:04:54 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 27 Feb 2025 08:04:54 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v7] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 12:00:47 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> Currently, `string_compare` code is a bit complicated, main reasons include: >> 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. >> 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. >> >> This is not good for code reading and maintaining. >> >> >> So, this patch does following refactoring: >> 1. merge LU and UL code into one, i.e. remove UL code. >> 2. seperate the code into 2 methods: LL/UU and LU/UL. >> 3. some other misc improvement. >> >> I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. >> 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. >> 2. make `SHORT_STRING` case simpler. >> >> >> >> Thanks > > Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: > > - check short string > - rename Latest version looks good to me. Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23633#pullrequestreview-2646906669 From duke at openjdk.org Thu Feb 27 08:28:57 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 27 Feb 2025 08:28:57 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception [v2] In-Reply-To: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: > As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. > > The obvious fix that is > > Cell limit = local(_outer->max_locals()); > for (Cell c = start_cell(); c < limit; c = next_cell(c)) { > > since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course > > Cell limit = (Cell)(_outer->max_locals()); > > would work, but it seems to break (the very light) abstraction. > > I've also added an assert to transform the UB into a clear failure. > > This fix makes the UB warning go away on Mac with arm64. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Introduce local_limit_cell ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23772/files - new: https://git.openjdk.org/jdk/pull/23772/files/e2ba08fa..ff7461b5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23772&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23772&range=00-01 Stats: 13 lines in 2 files changed: 2 ins; 2 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23772.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23772/head:pull/23772 PR: https://git.openjdk.org/jdk/pull/23772 From duke at openjdk.org Thu Feb 27 08:28:58 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 27 Feb 2025 08:28:58 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v5] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 21:13:36 GMT, Marc Chevalier wrote: >> Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Factor testing whether a node is a data proj of a pure function Extract it as a method of `Node`. I'm not quite satisfied with that, but I'm not sure what would be a better place for it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23694#issuecomment-2687232477 From duke at openjdk.org Thu Feb 27 08:28:58 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 27 Feb 2025 08:28:58 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception In-Reply-To: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: On Tue, 25 Feb 2025 10:11:54 GMT, Marc Chevalier wrote: > As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. > > The obvious fix that is > > Cell limit = local(_outer->max_locals()); > for (Cell c = start_cell(); c < limit; c = next_cell(c)) { > > since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course > > Cell limit = (Cell)(_outer->max_locals()); > > would work, but it seems to break (the very light) abstraction. > > I've also added an assert to transform the UB into a clear failure. > > This fix makes the UB warning go away on Mac with arm64. > > Thanks, > Marc I went with the first one then. I also like the fact that it hides the underlying int: it's all `Cell`, whatever that is, but known to be enumerable. It also doesn't use the fact that locals starts at 0 unlike the second option. One could have then done `for (int i = (int)start_cell(); ...` but doesn't it start to get heavy? (sure, the explicit conversion might be not quite necessary). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23772#issuecomment-2687229914 From mbaesken at openjdk.org Thu Feb 27 08:30:00 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Thu, 27 Feb 2025 08:30:00 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Wed, 26 Feb 2025 16:46:54 GMT, Amit Kumar wrote: > You mentioned that the crash happened during the build ? I just checked and at least the build is fine on s390x. Yes during the **minimal** build (` --with-jvm-features=minimal --with-jvm-variants=minimal` ). Seems you miss `--with-jvm-features=minimal` . ------------- PR Comment: https://git.openjdk.org/jdk/pull/23794#issuecomment-2687234167 From duke at openjdk.org Thu Feb 27 08:36:01 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 27 Feb 2025 08:36:01 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v4] In-Reply-To: References: Message-ID: <6MwnDs2m8JVTwrJQWcSxg2_fPUZnbafvfYcIyHsgX0Q=.412777c6-2add-4856-aa29-350b29abd97b@github.com> On Wed, 26 Feb 2025 16:33:05 GMT, Damon Fenacci wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> comment > > test/hotspot/jtreg/compiler/c2/irTests/LShiftINodeIdealizationTests.java line 255: > >> 253: res[0] = (short) (a[0] << 3); >> 254: return res; >> 255: } > > In the comment of method `StoreNode::Ideal_sign_extended_input` you mention shifting left by more than 24. Do you think it would be possible to have a test for that case too (shifting left by more than 16 in this case I guess)? I'm talking about is the case where the shift exceed 24, but taking into account the double shifting used to sign-extend the value. In the case of this test, we would get a `((a[0] << 3) << 16) >> 16` that my improvement will change into `(a[0] << 19) >> 16`which used not to be supported. The 19 here is the shift larger than 16. But I think you meant replacing the `3` in the test with something like `17`. Then, the expression will look like `((a[0] << 17) << 16) >> 16` that my change collapses into `0 >> 16`, since 16 + 17 = 33 is bigger than the length of an int. (It is tested by one of the previous cases in the same file). So nothing really exciting happens there. Or maybe you're suggesting something else I've missed? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1973111044 From amitkumar at openjdk.org Thu Feb 27 09:23:07 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 27 Feb 2025 09:23:07 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Wed, 26 Feb 2025 09:23:25 GMT, Matthias Baesken wrote: > When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. > AIX crash is (linux ppc64le crash is similar) : > > > # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 > # Error: ShouldNotReachHere() > # > > iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) > lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) > sp: 0x000000011023aab0 (base - 0x2DD8) > rtoc: 0x08001000a0088ff0 > |---stackaddr----| |----lrsave------|: > 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) > 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) > 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) > 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) > 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) > 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) > 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bc50 - 0x090000... Ok I ran another build with this configuration and still build was successful : bash configure \ --with-boot-jdk=$HOME/boot_jdk_23 \ --with-jtreg=$HOME/jtreg \ --with-gtest=$HOME/googletest \ --with-jmh=build/jmh/jars \ --with-jvm-variants=minimal \ --with-jvm-features=minimal \ --with-debug-level=fastdebug \ --with-native-debug-symbols=internal \ --disable-precompiled-headers ------------- PR Comment: https://git.openjdk.org/jdk/pull/23794#issuecomment-2687358811 From adinn at openjdk.org Thu Feb 27 09:34:02 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 27 Feb 2025 09:34:02 GMT Subject: RFR: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity [v2] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 18:18:46 GMT, Vladimir Kozlov wrote: >> Andrew Dinn has updated the pull request incrementally with two additional commits since the last revision: >> >> - another format fix >> - format fix > > Our builds and tier1 testing passed. @vnkozlov @dean-long Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23776#issuecomment-2687384872 From adinn at openjdk.org Thu Feb 27 09:34:03 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 27 Feb 2025 09:34:03 GMT Subject: Integrated: 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 14:48:26 GMT, Andrew Dinn wrote: > The compiler blob base size needs increasing in case the JDK is built without ZGC. The increment when ZGC is used can be comparably decreased. The final blob size increment when ZGC is included is over generous and can also be decreased. This pull request has now been integrated. Changeset: 4522f128 Author: Andrew Dinn URL: https://git.openjdk.org/jdk/commit/4522f128a3953e3ae885f96c463cb581eaa1e1e7 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8349921: Crash in codeBuffer.cpp:1004: guarantee(sect->end() <= tend) failed: sanity Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.org/jdk/pull/23776 From mbaesken at openjdk.org Thu Feb 27 09:55:02 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Thu, 27 Feb 2025 09:55:02 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Thu, 27 Feb 2025 09:19:50 GMT, Amit Kumar wrote: >> When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. >> AIX crash is (linux ppc64le crash is similar) : >> >> >> # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 >> # Error: ShouldNotReachHere() >> # >> >> iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) >> lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) >> sp: 0x000000011023aab0 (base - 0x2DD8) >> rtoc: 0x08001000a0088ff0 >> |---stackaddr----| |----lrsave------|: >> 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) >> 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) >> 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) >> 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) >> 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) >> 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) >> 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) >> 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 p... > > Ok I ran another build with this configuration and still build was successful : > > bash configure \ > --with-boot-jdk=$HOME/boot_jdk_23 \ > --with-jtreg=$HOME/jtreg \ > --with-gtest=$HOME/googletest \ > --with-jmh=build/jmh/jars \ > --with-jvm-variants=minimal \ > --with-jvm-features=minimal \ > --with-debug-level=fastdebug \ > --with-native-debug-symbols=internal \ > --disable-precompiled-headers Hi @offamitkumar this is a bit surprising because the matcher header is only included if COMPILER2 is defined https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/s390/compiledIC_s390.cpp#L32 and this should not be the case in minimal JVM , but if this somehow works on your side, then fine :-) ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23794#issuecomment-2687438290 From redestad at openjdk.org Thu Feb 27 10:25:04 2025 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 27 Feb 2025 10:25:04 GMT Subject: RFR: 8350614: [JMH] jdk.incubator.vector.VectorCommutativeOperSharingBenchmark fails In-Reply-To: References: Message-ID: <7lWJ8WK9cwcffGabA4qKJmhjCr42qIAvZftoZMiaPfY=.b0fcd7bf-2181-4711-9953-d9911bc16565@github.com> On Tue, 25 Feb 2025 02:24:33 GMT, SendaoYan wrote: > Hi all, > > The newly added JMH test jdk.incubator.vector.VectorCommutativeOperSharingBenchmark run fails "java.lang.NoClassDefFoundError: jdk/incubator/vector/Vector". > > The `@Fork(jvmArgsPrepend = ..)` in microbenchmarks should replaced as `@Fork(jvmArgs = ..)` after [JDK-8343345](https://bugs.openjdk.org/browse/JDK-8343345). Change has been verified locally, test-fix only, no risk. LGTM Yes, please use only `jvmArgs` in `@Fork` annotations. While the distinction between `jvmArgs`, `-Prepend` and `-Append` is subtle and somewhat arbitrary we opted ([JDK-8342958](https://bugs.openjdk.org/browse/JDK-8342958) + [JDK-8343345](https://bugs.openjdk.org/browse/JDK-8343345)) to designate `jvmArgs` to be used in benchmark annotations to set benchmark-specific flags, reserving `jvmArgsPrepend` for use by `make test` or whatever benchmarking harness you might use to set up environment flags (such as the native library path), and leaving `jvmArgsAppend` for command line extras (specifying GC etc). ------------- Marked as reviewed by redestad (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23761#pullrequestreview-2647301378 From syan at openjdk.org Thu Feb 27 11:10:07 2025 From: syan at openjdk.org (SendaoYan) Date: Thu, 27 Feb 2025 11:10:07 GMT Subject: RFR: 8350614: [JMH] jdk.incubator.vector.VectorCommutativeOperSharingBenchmark fails In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 02:24:33 GMT, SendaoYan wrote: > Hi all, > > The newly added JMH test jdk.incubator.vector.VectorCommutativeOperSharingBenchmark run fails "java.lang.NoClassDefFoundError: jdk/incubator/vector/Vector". > > The `@Fork(jvmArgsPrepend = ..)` in microbenchmarks should replaced as `@Fork(jvmArgs = ..)` after [JDK-8343345](https://bugs.openjdk.org/browse/JDK-8343345). Change has been verified locally, test-fix only, no risk. Thanks all for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23761#issuecomment-2687625470 From syan at openjdk.org Thu Feb 27 11:10:08 2025 From: syan at openjdk.org (SendaoYan) Date: Thu, 27 Feb 2025 11:10:08 GMT Subject: Integrated: 8350614: [JMH] jdk.incubator.vector.VectorCommutativeOperSharingBenchmark fails In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 02:24:33 GMT, SendaoYan wrote: > Hi all, > > The newly added JMH test jdk.incubator.vector.VectorCommutativeOperSharingBenchmark run fails "java.lang.NoClassDefFoundError: jdk/incubator/vector/Vector". > > The `@Fork(jvmArgsPrepend = ..)` in microbenchmarks should replaced as `@Fork(jvmArgs = ..)` after [JDK-8343345](https://bugs.openjdk.org/browse/JDK-8343345). Change has been verified locally, test-fix only, no risk. This pull request has now been integrated. Changeset: acc6f19c Author: SendaoYan URL: https://git.openjdk.org/jdk/commit/acc6f19cecd1c55afab3f4d6789cfa90b472d621 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8350614: [JMH] jdk.incubator.vector.VectorCommutativeOperSharingBenchmark fails Reviewed-by: redestad ------------- PR: https://git.openjdk.org/jdk/pull/23761 From dlong at openjdk.org Thu Feb 27 11:19:02 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 27 Feb 2025 11:19:02 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception [v2] In-Reply-To: References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: On Thu, 27 Feb 2025 08:28:57 GMT, Marc Chevalier wrote: >> As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. >> >> The obvious fix that is >> >> Cell limit = local(_outer->max_locals()); >> for (Cell c = start_cell(); c < limit; c = next_cell(c)) { >> >> since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course >> >> Cell limit = (Cell)(_outer->max_locals()); >> >> would work, but it seems to break (the very light) abstraction. >> >> I've also added an assert to transform the UB into a clear failure. >> >> This fix makes the UB warning go away on Mac with arm64. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Introduce local_limit_cell Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23772#pullrequestreview-2647442246 From dfenacci at openjdk.org Thu Feb 27 11:55:03 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 27 Feb 2025 11:55:03 GMT Subject: RFR: 8347459: C2: missing transformation for chain of shifts/multiplications by constants [v4] In-Reply-To: <6MwnDs2m8JVTwrJQWcSxg2_fPUZnbafvfYcIyHsgX0Q=.412777c6-2add-4856-aa29-350b29abd97b@github.com> References: <6MwnDs2m8JVTwrJQWcSxg2_fPUZnbafvfYcIyHsgX0Q=.412777c6-2add-4856-aa29-350b29abd97b@github.com> Message-ID: On Thu, 27 Feb 2025 08:33:44 GMT, Marc Chevalier wrote: >> test/hotspot/jtreg/compiler/c2/irTests/LShiftINodeIdealizationTests.java line 255: >> >>> 253: res[0] = (short) (a[0] << 3); >>> 254: return res; >>> 255: } >> >> In the comment of method `StoreNode::Ideal_sign_extended_input` you mention shifting left by more than 24. Do you think it would be possible to have a test for that case too (shifting left by more than 16 in this case I guess)? > > I'm talking about is the case where the shift exceed 24, but taking into account the double shifting used to sign-extend the value. In the case of this test, we would get a `((a[0] << 3) << 16) >> 16` that my improvement will change into `(a[0] << 19) >> 16`which used not to be supported. The 19 here is the shift larger than 16. > > But I think you meant replacing the `3` in the test with something like `17`. Then, the expression will look like `((a[0] << 17) << 16) >> 16` that my change collapses into `0 >> 16`, since 16 + 17 = 33 is bigger than the length of an int. (It is tested by one of the previous cases in the same file). So nothing really exciting happens there. > > Or maybe you're suggesting something else I've missed? All good, I think I simply misunderstood the comment. The case is actually the one where the shift is more than 24. Thanks for the explanation! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23728#discussion_r1973430094 From chagedorn at openjdk.org Thu Feb 27 13:12:43 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 27 Feb 2025 13:12:43 GMT Subject: RFR: 8350579: Remove Template Assertion Predicates belonging to a loop once it is folded away during IGVN Message-ID: <5Nbgi31ds2bXEF3Uc9AL5DyAOUdmme6DCdvly0aY-60=.92863b9e-00b1-4894-87d1-1f460c8d5b20@github.com> The patch fixes the issue of creating an Initialized Assertion Predicate at a loop X from a Template Assertion Predicate that was originally created for a loop Y. Using the unrelated loop values from loop Y for the Initialized Assertion Predicate will let it fail during runtime and we execute a `halt` instruction. This was originally reported with [JDK-8305428](https://bugs.openjdk.org/browse/JDK-8305428). Note that most of the line changes are from new tests. ### The Problem There are multiple test cases triggering the same problem. In the following, when referring to "the test case", I'm referring to `testTemplateAssertionPredicateNotRemovedHalt()` which was written from scratch and contains more detailed comments explaining how we end up with executing a `Halt` node in more details. #### An Inner Loop without Parse Predicates The graph in `testTemplateAssertionPredicateNotRemovedHalt()` looks like this after creating `LoopNodes` for the outer `for` and inner `while (true)` loop: ![image](https://github.com/user-attachments/assets/7ac60e35-0b7e-4f04-b9dd-6eb8c8654a15) We only have Parse Predicates for the outer loop. Why? Before beautify loop, we have the following region which merges multiple backedges - the one from the `for` loop and the one from the `while (true)` loop: ![image](https://github.com/user-attachments/assets/7895161d-5ac1-46d6-93fe-5ab90ef24ab9) In `IdealLoopTree::merge_many_backedges()`, we notice that the hottest backedge is hot enough such that it is worth to have a separate merge point region for the inner and outer loop. We set everything up and eventually in `IdealLoopTree::split_outer_loop()`, we create a second `LoopNode`. For this inner `LoopNode`, we cannot set up `Parse Predicates` with the same UCTs as used for the outer loop. It would be incorrect when taking the trap to re-execute the inner and outer loop again while having already executed some of the outer loop's iterations. Thus, we get the graph shape with back-to-back `LoopNodes` as shown above. #### Predicates from a Folded Loop End up at Another Loop As described in the previous section, we have an inner and outer `LoopNode` while the inner does not have Parse Predicates. In a series of events (see test case comments for more details), we first hoist a range check out of the outer loop during Loop Predication with a Template Assertion Predicate. Then, we fold the outer loop away because we find that it is only running for a single iteration and the backedge is never taken. The Template Assertion Predicate together with the Parse Predicates end up at the inner loop running from `i = 80`: ![image](https://github.com/user-attachments/assets/9fd20b92-69f7-467e-bd3b-1756c5d5fb62) #### Creating Initialized Assertion Predicate with Wrong Loop Values We now split the inner loop by creating pre-main-post loops. In this process, we create new Template Assertion Predicates with the new init value of the main and post loop. We also create Initialized Assertion Predicates from the new templates. But these now use the init value from the inner loop, even though the Assertion Predicates were created with the loop values from the outer loop: ![image](https://github.com/user-attachments/assets/242b379e-e279-48b3-b354-fba2e67ee257) `iArrShort` has only a size of `10` but `512 Phi` takes value `80`. During runtime, this Initialized Assertion Predicate fails and we crash by executing a halt instruction. ### Proposed Solution We should remove any Template Assertion Predicate when a `CountedLoopNode` is folded away. This is implemented in `CountedLoopNode::Ideal()` to do that right during IGVN when a loop node is folded. This ensures that we do not miss any dying loop. #### Implementation Details - I introduced a new `KillTemplateAssertionPredicates` visitor to do that. This required a new `TemplateAssertionPredicate::kill_during_igvn()` method to directly operate on `PhaseIterGVN` instead of `PhaseIdealLoop`. - Regular Predicates (i.e. Runtime or Assertion Predicates) all use `If` nodes with some specific inputs (i.e. flavors of `Opaque*` nodes) or outputs (i.e. `Halt` or UCTs). Since we now use `PredicateIterator` during IGVN, we need to be more careful when a Regular Predicate is being folded away to still recognize it as a Regular Predicate. When we fail to do so, we could stop the iteration and miss predicates above. The existing checks are not strong enough and required the following tweaks for some situations: - An Assertion Predicate `If` has a `ConI` as input because the `Opaque*` node was already folded. - -> Also check that we have a `Halt` node on the false path (done with `AssertionPredicate::may_be_assertion_predicate_if()`). - A Regular Predicate `If` already lost one of its output. - -> Treat such a predicate as Runtime Predicate and assume that an Assertion Predicate `If` always has two outputs (done with `RuntimePredicate::is_being_folded_without_uncommon_proj()` and `AssertionPredicate::may_be_assertion_predicate_if()`). - Added some comments at the predicate classes to reflect these changes. ### Tests - Hand written tests together with some tests that triggered issues during the implementation in the initial full Assertion Predicate prototype patch. - Tests from the report of JDK-8305428 and its duplicates. Thanks, Christian ------------- Commit messages: - 8350579: Remove Template Assertion Predicates belonging to a Changes: https://git.openjdk.org/jdk/pull/23823/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23823&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350579 Stats: 571 lines in 4 files changed: 545 ins; 2 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/23823.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23823/head:pull/23823 PR: https://git.openjdk.org/jdk/pull/23823 From duke at openjdk.org Thu Feb 27 13:19:44 2025 From: duke at openjdk.org (kuaiwei) Date: Thu, 27 Feb 2025 13:19:44 GMT Subject: RFR: 8350858: [IR Framework] Some tests failed on Cascade Lake Message-ID: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> I found some vector tests failed on my Cascade Lake machine. The root cause is the CPU_SKYLAKE_PATTERN can not handle cpu with 2-digits stepping. The failed cpu model is lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 24 Socket(s): 4 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz Stepping: 11 CPU MHz: 2300.000 CPU max MHz: 4200.0000 CPU min MHz: 1000.0000 BogoMIPS: 4600.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 33792K NUMA node0 CPU(s): 0-95 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities The fix is trival. PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? ------------- Commit messages: - 8350858: [IR Framework] Some tests failed on Cascade Lake Changes: https://git.openjdk.org/jdk/pull/23824/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23824&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350858 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23824.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23824/head:pull/23824 PR: https://git.openjdk.org/jdk/pull/23824 From chagedorn at openjdk.org Thu Feb 27 13:31:01 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 27 Feb 2025 13:31:01 GMT Subject: RFR: 8350858: [IR Framework] Some tests failed on Cascade Lake In-Reply-To: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> References: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> Message-ID: On Thu, 27 Feb 2025 13:14:01 GMT, kuaiwei wrote: > I found some vector tests failed on my Cascade Lake machine. > > The root cause is the CPU_SKYLAKE_PATTERN can not handle cpu with 2-digits stepping. The failed cpu model is > > lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 96 > On-line CPU(s) list: 0-95 > Thread(s) per core: 1 > Core(s) per socket: 24 > Socket(s): 4 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz > Stepping: 11 > CPU MHz: 2300.000 > CPU max MHz: 4200.0000 > CPU min MHz: 1000.0000 > BogoMIPS: 4600.00 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 1024K > L3 cache: 33792K > NUMA node0 CPU(s): 0-95 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities > > > The fix is trival. > > PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? You could also limit it to at max 2 digits with `\d{1,2}` but I think it's also fine to just use `d+`. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23824#pullrequestreview-2647781763 From roland at openjdk.org Thu Feb 27 13:33:54 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Feb 2025 13:33:54 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v2] In-Reply-To: References: Message-ID: <2LjrBD7fBxOUVa_lPINgYdSHVQP1K7FAFLr7LhA6T8w=.4d91d888-2c24-4831-b2f7-fbbaea629a8f@github.com> On Tue, 18 Feb 2025 19:27:22 GMT, Kangcheng Xu wrote: >> [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. >> >> When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) >> >> The following was implemented to address this issue. >> >> if (UseNewCode2) { >> *multiplier = bt == T_INT >> ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows >> : ((jlong) 1) << con->get_int(); >> } else { >> *multiplier = ((jlong) 1 << con->get_int()); >> } >> >> >> Two new bitshift overflow tests were added. > > Kangcheng Xu has updated the pull request incrementally with two additional commits since the last revision: > > - use explicit argument types for overloaded java_shift_left() > - use java_shift_left() src/hotspot/share/opto/addnode.cpp line 523: > 521: } > 522: > 523: lhs_multiplier = bt == T_INT Can you define: `java_shift_left(jlong, jint, BasicType bt)` that then call one or the other `java_shift_left`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r1973588314 From duke at openjdk.org Thu Feb 27 13:34:57 2025 From: duke at openjdk.org (kuaiwei) Date: Thu, 27 Feb 2025 13:34:57 GMT Subject: RFR: 8350858: [IR Framework] Some tests failed on Cascade Lake In-Reply-To: References: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> Message-ID: On Thu, 27 Feb 2025 13:28:29 GMT, Christian Hagedorn wrote: >> I found some vector tests failed on my Cascade Lake machine. >> >> The root cause is the CPU_SKYLAKE_PATTERN can not handle cpu with 2-digits stepping. The failed cpu model is >> >> lscpu >> Architecture: x86_64 >> CPU op-mode(s): 32-bit, 64-bit >> Byte Order: Little Endian >> CPU(s): 96 >> On-line CPU(s) list: 0-95 >> Thread(s) per core: 1 >> Core(s) per socket: 24 >> Socket(s): 4 >> NUMA node(s): 1 >> Vendor ID: GenuineIntel >> CPU family: 6 >> Model: 85 >> Model name: Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz >> Stepping: 11 >> CPU MHz: 2300.000 >> CPU max MHz: 4200.0000 >> CPU min MHz: 1000.0000 >> BogoMIPS: 4600.00 >> Virtualization: VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 1024K >> L3 cache: 33792K >> NUMA node0 CPU(s): 0-95 >> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities >> >> >> The fix is trival. >> >> PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? > > You could also limit it to at max 2 digits with `\d{1,2}` but I think it's also fine to just use `d+`. @chhagedorn Thanks for your quick response and approve. Do I need wait another reviewer? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23824#issuecomment-2687975136 From epeter at openjdk.org Thu Feb 27 13:48:03 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Feb 2025 13:48:03 GMT Subject: RFR: 8350858: [IR Framework] Some tests failed on Cascade Lake In-Reply-To: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> References: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> Message-ID: On Thu, 27 Feb 2025 13:14:01 GMT, kuaiwei wrote: > I found some vector tests failed on my Cascade Lake machine. > > The root cause is the CPU_SKYLAKE_PATTERN can not handle cpu with 2-digits stepping. The failed cpu model is > > lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 96 > On-line CPU(s) list: 0-95 > Thread(s) per core: 1 > Core(s) per socket: 24 > Socket(s): 4 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz > Stepping: 11 > CPU MHz: 2300.000 > CPU max MHz: 4200.0000 > CPU min MHz: 1000.0000 > BogoMIPS: 4600.00 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 1024K > L3 cache: 33792K > NUMA node0 CPU(s): 0-95 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities > > > The fix is trival. > > PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? Looks good, thanks for the change! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23824#pullrequestreview-2647822463 From duke at openjdk.org Thu Feb 27 13:48:04 2025 From: duke at openjdk.org (kuaiwei) Date: Thu, 27 Feb 2025 13:48:04 GMT Subject: RFR: 8350858: [IR Framework] Some tests failed on Cascade Lake In-Reply-To: References: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> Message-ID: On Thu, 27 Feb 2025 13:42:45 GMT, Emanuel Peter wrote: >> I found some vector tests failed on my Cascade Lake machine. >> >> The root cause is the CPU_SKYLAKE_PATTERN can not handle cpu with 2-digits stepping. The failed cpu model is >> >> lscpu >> Architecture: x86_64 >> CPU op-mode(s): 32-bit, 64-bit >> Byte Order: Little Endian >> CPU(s): 96 >> On-line CPU(s) list: 0-95 >> Thread(s) per core: 1 >> Core(s) per socket: 24 >> Socket(s): 4 >> NUMA node(s): 1 >> Vendor ID: GenuineIntel >> CPU family: 6 >> Model: 85 >> Model name: Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz >> Stepping: 11 >> CPU MHz: 2300.000 >> CPU max MHz: 4200.0000 >> CPU min MHz: 1000.0000 >> BogoMIPS: 4600.00 >> Virtualization: VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 1024K >> L3 cache: 33792K >> NUMA node0 CPU(s): 0-95 >> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities >> >> >> The fix is trival. >> >> PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? > > Looks good, thanks for the change! @eme64 Thanks for your approve. PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? Can you help check this question? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23824#issuecomment-2688008833 From duke at openjdk.org Thu Feb 27 13:48:05 2025 From: duke at openjdk.org (duke) Date: Thu, 27 Feb 2025 13:48:05 GMT Subject: RFR: 8350858: [IR Framework] Some tests failed on Cascade Lake In-Reply-To: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> References: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> Message-ID: On Thu, 27 Feb 2025 13:14:01 GMT, kuaiwei wrote: > I found some vector tests failed on my Cascade Lake machine. > > The root cause is the CPU_SKYLAKE_PATTERN can not handle cpu with 2-digits stepping. The failed cpu model is > > lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 96 > On-line CPU(s) list: 0-95 > Thread(s) per core: 1 > Core(s) per socket: 24 > Socket(s): 4 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz > Stepping: 11 > CPU MHz: 2300.000 > CPU max MHz: 4200.0000 > CPU min MHz: 1000.0000 > BogoMIPS: 4600.00 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 1024K > L3 cache: 33792K > NUMA node0 CPU(s): 0-95 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities > > > The fix is trival. > > PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? @kuaiwei Your change (at version 29d4022c062ad1717640d2a6817e53f8667bbe50) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23824#issuecomment-2688010807 From mdoerr at openjdk.org Thu Feb 27 13:58:04 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 27 Feb 2025 13:58:04 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Wed, 26 Feb 2025 09:23:25 GMT, Matthias Baesken wrote: > When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. > AIX crash is (linux ppc64le crash is similar) : > > > # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 > # Error: ShouldNotReachHere() > # > > iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) > lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) > sp: 0x000000011023aab0 (base - 0x2DD8) > rtoc: 0x08001000a0088ff0 > |---stackaddr----| |----lrsave------|: > 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) > 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) > 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) > 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) > 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) > 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) > 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bc50 - 0x090000... LGTM. Thanks for fixing it! ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23794#pullrequestreview-2647859992 From duke at openjdk.org Thu Feb 27 13:59:06 2025 From: duke at openjdk.org (kuaiwei) Date: Thu, 27 Feb 2025 13:59:06 GMT Subject: Integrated: 8350858: [IR Framework] Some tests failed on Cascade Lake In-Reply-To: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> References: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> Message-ID: <3Jz73VakU5j0Z8AE1_Kg08BLLg8MU-66gjJJSz9WDgY=.62c0927f-00b9-400b-bfab-86318047c569@github.com> On Thu, 27 Feb 2025 13:14:01 GMT, kuaiwei wrote: > I found some vector tests failed on my Cascade Lake machine. > > The root cause is the CPU_SKYLAKE_PATTERN can not handle cpu with 2-digits stepping. The failed cpu model is > > lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 96 > On-line CPU(s) list: 0-95 > Thread(s) per core: 1 > Core(s) per socket: 24 > Socket(s): 4 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz > Stepping: 11 > CPU MHz: 2300.000 > CPU max MHz: 4200.0000 > CPU min MHz: 1000.0000 > BogoMIPS: 4600.00 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 1024K > L3 cache: 33792K > NUMA node0 CPU(s): 0-95 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities > > > The fix is trival. > > PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? This pull request has now been integrated. Changeset: 3c9d64eb Author: Kuai Wei Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/3c9d64eb07c5bc9006ef05b0ab81bdc318cccc20 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8350858: [IR Framework] Some tests failed on Cascade Lake Reviewed-by: chagedorn, epeter ------------- PR: https://git.openjdk.org/jdk/pull/23824 From chagedorn at openjdk.org Thu Feb 27 14:42:00 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 27 Feb 2025 14:42:00 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v4] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: On Tue, 25 Feb 2025 19:11:15 GMT, Daniel Lund?n wrote: >> When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. >> >> ### Changeset >> >> It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. >> >> To illustrate the idealization and how it resolves this issue, consider the example below. >> >> ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) >> >> `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. >> >> We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. >> >> The changeset consists of the following changes. >> - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. >> - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. >> - Add multiple new regression tests in `TestGCMLoadPlacement.java`. >> >> For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/ac... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Fix subtle bug introduced in previous update I agree with this solution. Thanks for deep diving once more into it an proposing this alternative solution. Great work! I left a few comments, otherwise, looks good. Thanks for the credit! Maybe you also want to add Emanuel as contributor since he also joined in the discussions :-) src/hotspot/share/opto/cfgnode.cpp line 2393: > 2391: // non-termination. > 2392: uint merge_width = 0; > 2393: bool split_must_terminate = false; // Is splitting guaranteed to terminate? Needed to read it twice to understand it. How about `split_always_terminates` instead to make it more clear? src/hotspot/share/opto/cfgnode.cpp line 2435: > 2433: ResourceMark rm; > 2434: VectorSet visited; > 2435: Node_List worklist; You could also use a `Unique_Node_List` instead of a `Node_List` + `VectorSet`: Unique_Node_List worklist; for (uint i = 0; i < worklist.size(); i++) { } src/hotspot/share/opto/cfgnode.cpp line 2448: > 2446: }; > 2447: split_must_terminate = true; // Assume no circularity until proven otherwise. > 2448: while (split_must_terminate && worklist.size() > 0) { Seems like you have `split_must_terminate` in this condition because the `break` in the `else` path does not exit both loops. When using a separate method, you could directly return false when finding that `split_must_terminate` is false. Then you can remove `split_must_terminate` from this exit condition. src/hotspot/share/opto/cfgnode.cpp line 2469: > 2467: } > 2468: } > 2469: } Would it make sense to extract this to a separate method `is_split_through_mergemem_terminating()` (or something similar) which returns the value for `split_must_terminate`? The `PhiNode::Ideal()` method is already extremely large. test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java line 30: > 28: * @bug 8333393 > 29: * @summary Test that loads are not scheduled too late. > 30: * @run main/othervm -XX:+UnlockDiagnosticVMOptions `PerMethodTrapLimit` is a pure product flag, so you can remove `-XX:+UnlockDiagnosticVMOptions`. Same below where you don't use `Stress*` flags (better double check again with a product build to be safe :-) ) test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java line 31: > 29: * @summary Test that loads are not scheduled too late. > 30: * @run main/othervm -XX:+UnlockDiagnosticVMOptions > 31: * -XX:CompileCommand=quiet Is `quiet` required? test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java line 101: > 99: > 100: int test() { > 101: for (int i = 0; i < 50; ++i) You should add missing braces here ------------- PR Review: https://git.openjdk.org/jdk/pull/23691#pullrequestreview-2647941285 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1973677330 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1973689113 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1973691300 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1973682276 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1973704095 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1973699874 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1973705788 From bulasevich at openjdk.org Thu Feb 27 14:31:31 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 27 Feb 2025 14:31:31 GMT Subject: RFR: 8343789: Move mutable nmethod data out of CodeCache [v13] In-Reply-To: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> References: <9mDuowjpORyWudVnSB1FWCW_o1pBgMnAvJus6YGkXLs=.67ba4652-2470-448d-baa2-464e824b2fcb@github.com> Message-ID: > This change relocates mutable data (such as relocations, oops, and metadata) from the nmethod. The change follows the recent PR #18984, which relocated immutable nmethod data from the CodeCache. > > The core idea remains the same: use the CodeCache for executable code while moving additional data to the C heap. The primary motivations are improving security and enhancing code density. > > Although performance is not the main focus, testing on AArch64 CPUs, where code density plays a significant role, has shown a 1?2% performance improvement in specific scenarios, such as the CodeCacheStress test and the Renaissance Dotty benchmark. > > The numbers. Immutable data constitutes **~30%** on the nmehtod. Mutable data constitutes **~8%** of nmethod. Example (statistics collected on the CodeCacheStress benchmark): > - nmethod_count:134000, total_compilation_time: 510460ms > - total allocation time malloc_mutable/malloc_immutable/CodeCache_alloc: 62ms/114ms/6333ms, > - total allocation size (mutable/immutable/nmentod): 64MB/192MB/488MB > > Functional testing: jtreg on arm/aarch/x86. > Performance testing: renaissance/dacapo/SPECjvm2008 benchmarks. > > Alternative solution (see comments): In the future, relocations can be moved to _immutable_data. Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: - cleanup - returning oops back to nmethods. jtreg: Ok, performance: Ok. todo: cleanup - Address review comments: cleanup, move fields to avoid padding, fix CodeBlob purge to call os::free, fix nmethod::print, update Layout description - add a separate adrp_movk function to to support targets located more than 4GB away - Force the use of movk in combination with adrp and ldr instructions to address scenarios where os::malloc allocates buffers beyond the typical ?4GB range accessible with adrp - Fixing TestFindInstMemRecursion test fail with XX:+StressReflectiveCode option: _relocation_size can exceed 64Kb, in this case _metadata_offset do not fit into int16. Fix: use _oops_size int16 field to calculate metadata offset - removing dead code - a bit of cleanup and addressing review suggestions - rework movoop for not_supports_instruction_patching case: correcting in ldr_constant and relocations fixup - remove _code_end_offset - ... and 4 more: https://git.openjdk.org/jdk/compare/3c9d64eb...56c0cc78 ------------- Changes: https://git.openjdk.org/jdk/pull/21276/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21276&range=12 Stats: 197 lines in 7 files changed: 87 ins; 37 del; 73 mod Patch: https://git.openjdk.org/jdk/pull/21276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21276/head:pull/21276 PR: https://git.openjdk.org/jdk/pull/21276 From chagedorn at openjdk.org Thu Feb 27 15:36:02 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 27 Feb 2025 15:36:02 GMT Subject: RFR: 8347040: C2: assert(!loop->_body.contains(in)) failed In-Reply-To: <6N-KwwoN-BWfSym13VC96nbLEr13eplzOO0S-s78Ihs=.c8e80a48-3c5f-45ca-ad22-f7b2bda3b48c@github.com> References: <6N-KwwoN-BWfSym13VC96nbLEr13eplzOO0S-s78Ihs=.c8e80a48-3c5f-45ca-ad22-f7b2bda3b48c@github.com> Message-ID: <4p-J3ooxwcZEmZmuhj_ci9u50rYIJjAoSBEs0sEe_oI=.cb74d644-a9a5-4341-8fd0-d4003c00c80d@github.com> On Tue, 18 Feb 2025 12:59:51 GMT, Roland Westrelin wrote: > `OuterStripMinedLoopNode::transform_to_counted_loop()` merges the > outer strip mined loop and the inner loop into a single loop. To > achieve that, it needs to append the nodes of the outer strip mined > loop to the body of the inner loop. To make sure each of these nodes > is appended only once, a `Unique_Node_List` is used: nodes found by > following the safepoint's inputs are first enqueued into the list and > then, each unique node should be added to the loop body. That's not > what the current code does, though, because it enqueues the nodes it > finds to the list and add them to the loop body right away. That looks good to me. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23676#pullrequestreview-2648207219 From kxu at openjdk.org Thu Feb 27 16:14:22 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Thu, 27 Feb 2025 16:14:22 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v3] In-Reply-To: References: Message-ID: <0veFkmXd54EJUWuPKG7GyAeAPwh0o1epmY3L1zi2NFM=.dd175f75-09e6-4f3f-8214-59e4cd0a0de2@github.com> > [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. > > When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) > > The following was implemented to address this issue. > > if (UseNewCode2) { > *multiplier = bt == T_INT > ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows > : ((jlong) 1) << con->get_int(); > } else { > *multiplier = ((jlong) 1 << con->get_int()); > } > > > Two new bitshift overflow tests were added. Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 49 commits: - Merge branch 'master' into arithmetic-canonicalization - added bit shift operations accept an explicit BasicType - use explicit argument types for overloaded java_shift_left() - use java_shift_left() - remove UseNewCode - Merge branch 'master' into arithmetic-canonicalization - fix serial addition regression - remove trailing empty comments - fix comment grammar Co-authored-by: Christian Hagedorn - remove matching power-of-2 subtractions since it's already handled by Identity() - ... and 39 more: https://git.openjdk.org/jdk/compare/8323ddfe...0ffae804 ------------- Changes: https://git.openjdk.org/jdk/pull/23506/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=02 Stats: 478 lines in 5 files changed: 478 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23506.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23506/head:pull/23506 PR: https://git.openjdk.org/jdk/pull/23506 From mli at openjdk.org Thu Feb 27 16:20:03 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 27 Feb 2025 16:20:03 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v7] In-Reply-To: References: Message-ID: <9L8s4_SsiJV1ebsGw78_KN4ECxdnsxtAlOKf8x_unqw=.93608033-4983-476e-af02-d5402e2a9358@github.com> On Thu, 27 Feb 2025 08:02:39 GMT, Fei Yang wrote: > Latest version looks good to me. Thanks! Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23633#issuecomment-2688450403 From roland at openjdk.org Thu Feb 27 16:48:55 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Feb 2025 16:48:55 GMT Subject: RFR: 8350579: Remove Template Assertion Predicates belonging to a loop once it is folded away during IGVN In-Reply-To: <5Nbgi31ds2bXEF3Uc9AL5DyAOUdmme6DCdvly0aY-60=.92863b9e-00b1-4894-87d1-1f460c8d5b20@github.com> References: <5Nbgi31ds2bXEF3Uc9AL5DyAOUdmme6DCdvly0aY-60=.92863b9e-00b1-4894-87d1-1f460c8d5b20@github.com> Message-ID: On Thu, 27 Feb 2025 13:07:46 GMT, Christian Hagedorn wrote: > The patch fixes the issue of creating an Initialized Assertion Predicate at a loop X from a Template Assertion Predicate that was originally created for a loop Y. Using the unrelated loop values from loop Y for the Initialized Assertion Predicate will let it fail during runtime and we execute a `halt` instruction. This was originally reported with [JDK-8305428](https://bugs.openjdk.org/browse/JDK-8305428). > > Note that most of the line changes are from new tests. > > ### The Problem > There are multiple test cases triggering the same problem. In the following, when referring to "the test case", I'm referring to `testTemplateAssertionPredicateNotRemovedHalt()` which was written from scratch and contains more detailed comments explaining how we end up with executing a `Halt` node in more details. > > #### An Inner Loop without Parse Predicates > The graph in `testTemplateAssertionPredicateNotRemovedHalt()` looks like this after creating `LoopNodes` for the outer `for` and inner `while (true)` loop: > > ![image](https://github.com/user-attachments/assets/7ac60e35-0b7e-4f04-b9dd-6eb8c8654a15) > > We only have Parse Predicates for the outer loop. Why? > > Before beautify loop, we have the following region which merges multiple backedges - the one from the `for` loop and the one from the `while (true)` loop: > > ![image](https://github.com/user-attachments/assets/7895161d-5ac1-46d6-93fe-5ab90ef24ab9) > > In `IdealLoopTree::merge_many_backedges()`, we notice that the hottest backedge is hot enough such that it is worth to have a separate merge point region for the inner and outer loop. We set everything up and eventually in `IdealLoopTree::split_outer_loop()`, we create a second `LoopNode`. > > For this inner `LoopNode`, we cannot set up `Parse Predicates` with the same UCTs as used for the outer loop. It would be incorrect when taking the trap to re-execute the inner and outer loop again while having already executed some of the outer loop's iterations. Thus, we get the graph shape with back-to-back `LoopNodes` as shown above. > > #### Predicates from a Folded Loop End up at Another Loop > As described in the previous section, we have an inner and outer `LoopNode` while the inner does not have Parse Predicates. In a series of events (see test case comments for more details), we first hoist a range check out of the outer loop during Loop Predication with a Template Assertion Predicate. Then, we fold the outer loop away because we find that it is only running for a single iteration and the bac... That looks reasonable to me but imaking sure that predicates in the process of being removed are properly stepped over feels like something that could be fragile. So I'm wondering if there would be a way to mark predicates as being for a particular loop (maybe storing the loop's node id they apply to in predicate nodes and making sure it's properly updated as loops are cloned etc.) so when there is a mismatch between the loop and predicate it can be detected? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23823#issuecomment-2688530981 From kvn at openjdk.org Thu Feb 27 16:49:58 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Feb 2025 16:49:58 GMT Subject: RFR: 8347040: C2: assert(!loop->_body.contains(in)) failed In-Reply-To: <6N-KwwoN-BWfSym13VC96nbLEr13eplzOO0S-s78Ihs=.c8e80a48-3c5f-45ca-ad22-f7b2bda3b48c@github.com> References: <6N-KwwoN-BWfSym13VC96nbLEr13eplzOO0S-s78Ihs=.c8e80a48-3c5f-45ca-ad22-f7b2bda3b48c@github.com> Message-ID: On Tue, 18 Feb 2025 12:59:51 GMT, Roland Westrelin wrote: > `OuterStripMinedLoopNode::transform_to_counted_loop()` merges the > outer strip mined loop and the inner loop into a single loop. To > achieve that, it needs to append the nodes of the outer strip mined > loop to the body of the inner loop. To make sure each of these nodes > is appended only once, a `Unique_Node_List` is used: nodes found by > following the safepoint's inputs are first enqueued into the list and > then, each unique node should be added to the loop body. That's not > what the current code does, though, because it enqueues the nodes it > finds to the list and add them to the loop body right away. LGTM ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23676#pullrequestreview-2648433904 From roland at openjdk.org Thu Feb 27 16:49:59 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Feb 2025 16:49:59 GMT Subject: RFR: 8347040: C2: assert(!loop->_body.contains(in)) failed In-Reply-To: <4p-J3ooxwcZEmZmuhj_ci9u50rYIJjAoSBEs0sEe_oI=.cb74d644-a9a5-4341-8fd0-d4003c00c80d@github.com> References: <6N-KwwoN-BWfSym13VC96nbLEr13eplzOO0S-s78Ihs=.c8e80a48-3c5f-45ca-ad22-f7b2bda3b48c@github.com> <4p-J3ooxwcZEmZmuhj_ci9u50rYIJjAoSBEs0sEe_oI=.cb74d644-a9a5-4341-8fd0-d4003c00c80d@github.com> Message-ID: <_PFoSs2AGGAElHfzDOoYvND_3IxE0OHp6xkc8cLtWiw=.3f6b3bf4-c0f7-4116-86c2-2aa4c0d4fe46@github.com> On Thu, 27 Feb 2025 15:33:24 GMT, Christian Hagedorn wrote: >> `OuterStripMinedLoopNode::transform_to_counted_loop()` merges the >> outer strip mined loop and the inner loop into a single loop. To >> achieve that, it needs to append the nodes of the outer strip mined >> loop to the body of the inner loop. To make sure each of these nodes >> is appended only once, a `Unique_Node_List` is used: nodes found by >> following the safepoint's inputs are first enqueued into the list and >> then, each unique node should be added to the loop body. That's not >> what the current code does, though, because it enqueues the nodes it >> finds to the list and add them to the loop body right away. > > That looks good to me. @chhagedorn @vnkozlov thanks for the reviews ------------- PR Comment: https://git.openjdk.org/jdk/pull/23676#issuecomment-2688532742 From roland at openjdk.org Thu Feb 27 16:50:00 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Feb 2025 16:50:00 GMT Subject: Integrated: 8347040: C2: assert(!loop->_body.contains(in)) failed In-Reply-To: <6N-KwwoN-BWfSym13VC96nbLEr13eplzOO0S-s78Ihs=.c8e80a48-3c5f-45ca-ad22-f7b2bda3b48c@github.com> References: <6N-KwwoN-BWfSym13VC96nbLEr13eplzOO0S-s78Ihs=.c8e80a48-3c5f-45ca-ad22-f7b2bda3b48c@github.com> Message-ID: On Tue, 18 Feb 2025 12:59:51 GMT, Roland Westrelin wrote: > `OuterStripMinedLoopNode::transform_to_counted_loop()` merges the > outer strip mined loop and the inner loop into a single loop. To > achieve that, it needs to append the nodes of the outer strip mined > loop to the body of the inner loop. To make sure each of these nodes > is appended only once, a `Unique_Node_List` is used: nodes found by > following the safepoint's inputs are first enqueued into the list and > then, each unique node should be added to the loop body. That's not > what the current code does, though, because it enqueues the nodes it > finds to the list and add them to the loop body right away. This pull request has now been integrated. Changeset: 939815fd Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/939815fdcfd046b00b331e085c7b6c5ced0f5dbe Stats: 59 lines in 2 files changed: 57 ins; 2 del; 0 mod 8347040: C2: assert(!loop->_body.contains(in)) failed Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/23676 From roland at openjdk.org Thu Feb 27 16:57:27 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Feb 2025 16:57:27 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v10] In-Reply-To: References: Message-ID: <6P25Yy-0rkWudVp20tNwD1bWeozNUD0UoPdDlJIN7wc=.b07e7461-7af0-4fab-aa8b-a737b0b40591@github.com> > This change refactors `RShiftI`/`RshiftL` `Ideal`, `Identity` and > `Value` because the `int` and `long` versions are very similar and so > there's no logic duplication. In the process, support for some extra > transformations is added to `RShiftL`. I also added some new test > cases. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: - review - Merge branch 'master' into JDK-8349361 - review - review - review - Merge branch 'master' into JDK-8349361 - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Emanuel Peter - Update src/hotspot/share/opto/mulnode.cpp Co-authored-by: Emanuel Peter - review - Update src/hotspot/share/opto/mulnode.hpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> - ... and 7 more: https://git.openjdk.org/jdk/compare/c4fb8d3d...d3b1cf08 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23438/files - new: https://git.openjdk.org/jdk/pull/23438/files/5b05d222..d3b1cf08 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23438&range=08-09 Stats: 5943 lines in 194 files changed: 3935 ins; 1352 del; 656 mod Patch: https://git.openjdk.org/jdk/pull/23438.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23438/head:pull/23438 PR: https://git.openjdk.org/jdk/pull/23438 From roland at openjdk.org Thu Feb 27 16:57:30 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Feb 2025 16:57:30 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> Message-ID: On Mon, 24 Feb 2025 15:46:45 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: >> >> - review >> - review >> - review >> - Merge branch 'master' into JDK-8349361 >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Emanuel Peter >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Emanuel Peter >> - review >> - Update src/hotspot/share/opto/mulnode.hpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - ... and 5 more: https://git.openjdk.org/jdk/compare/5d94f680...5b05d222 > > test/hotspot/jtreg/compiler/c2/irTests/RShiftLNodeIdealizationTests.java line 125: > >> 123: final static int test7Shift = RunInfo.getRandom().nextInt(32) + 32; >> 124: final static long test7Min = -1L << (64 - test7Shift -1); >> 125: final static long test7Max = ~test7Min; > > Would you mind adding a quick comment about why you chose the values the way you do? Added in new commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1973984640 From roland at openjdk.org Thu Feb 27 16:57:30 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Feb 2025 16:57:30 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v9] In-Reply-To: <7D7P522wI5bsOeRi7aSh8QXNAlhqmlCi0-RBMl4Khpo=.bfc05889-fc07-4b5d-b68a-4c8c31f9e9ae@github.com> References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> <3Y2_P27vCJLrsTslSkZtoSFkuLi1dOjHP-CSMysqdFk=.d8f55e46-8eb0-4157-ae24-a52e3d71cb68@github.com> <7D7P522wI5bsOeRi7aSh8QXNAlhqmlCi0-RBMl4Khpo=.bfc05889-fc07-4b5d-b68a-4c8c31f9e9ae@github.com> Message-ID: <0X5BJ06C-elm5QokHglxUEFyIU0rzfijW2DHYaH-7Gg=.016ab898-30d3-4f63-a933-6eb89ed8f088@github.com> On Tue, 25 Feb 2025 12:25:46 GMT, Emanuel Peter wrote: >> The transformation only happens if the amounts we shift left and right are the same. So if they are random, the transformation won't apply most of the time and, rarely, it will (because they will turn out to be the same). I'm not sure how to write an IR test then. > > You would not have to assert anything about the IR, just do value verification in that case. Added in new commit. Is that what you had in mind? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1973985047 From roland at openjdk.org Thu Feb 27 16:57:27 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Feb 2025 16:57:27 GMT Subject: RFR: 8349361: C2: RShiftL should support all applicable transformations that RShiftI does [v10] In-Reply-To: References: <4RO3ysBh6pWId8Na0pTdO9X5sBvCh2F5l-KO3OdHF4k=.31a599cc-195f-4330-a4ce-0618209635de@github.com> Message-ID: On Mon, 24 Feb 2025 15:39:19 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: >> >> - review >> - Merge branch 'master' into JDK-8349361 >> - review >> - review >> - review >> - Merge branch 'master' into JDK-8349361 >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Emanuel Peter >> - Update src/hotspot/share/opto/mulnode.cpp >> >> Co-authored-by: Emanuel Peter >> - review >> - Update src/hotspot/share/opto/mulnode.hpp >> >> Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> >> - ... and 7 more: https://git.openjdk.org/jdk/compare/c4fb8d3d...d3b1cf08 > > src/hotspot/share/opto/mulnode.cpp line 1417: > >> 1415: if (shift == 0) return t1; >> 1416: // Calculate reasonably aggressive bounds for the result. >> 1417: // This is necessary if we are to correctly type things > > This is the same code as above, right? As commented above: it would be good to not move it and reduce the size of the diff. I reordered the changes. Can you take another look? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23438#discussion_r1973986205 From kvn at openjdk.org Thu Feb 27 17:12:04 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Feb 2025 17:12:04 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception [v2] In-Reply-To: References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: On Thu, 27 Feb 2025 08:28:57 GMT, Marc Chevalier wrote: >> As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. >> >> The obvious fix that is >> >> Cell limit = local(_outer->max_locals()); >> for (Cell c = start_cell(); c < limit; c = next_cell(c)) { >> >> since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course >> >> Cell limit = (Cell)(_outer->max_locals()); >> >> would work, but it seems to break (the very light) abstraction. >> >> I've also added an assert to transform the UB into a clear failure. >> >> This fix makes the UB warning go away on Mac with arm64. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Introduce local_limit_cell Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23772#pullrequestreview-2648501004 From kvn at openjdk.org Thu Feb 27 17:36:11 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 27 Feb 2025 17:36:11 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v5] In-Reply-To: References: Message-ID: <-9gm4aPe18_NVFUUQjkOSpAtT_yjZ7efMZX8TPabcFk=.3f2b0a3b-6d75-4ac1-80b0-51e110cf94d1@github.com> On Tue, 25 Feb 2025 21:13:36 GMT, Marc Chevalier wrote: >> Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Factor testing whether a node is a data proj of a pure function Good. src/hotspot/share/opto/node.cpp line 2952: > 2950: // the local in the caller. > 2951: bool Node::is_data_proj_of_pure_function(const Node* maybe_pure_function) const { > 2952: return Opcode() == Op_Proj && static_cast(this)->_con == TypeFunc::Parms && maybe_pure_function->is_pure_function(); I wish we had an other Node's accessor to check exact class to avoid calling `Opcode()` which is virtual. Currently `is_Proj()` will be `true` for all subclasses too and we don't want it here (and in some other places). ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2648569878 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1974052823 From duke at openjdk.org Thu Feb 27 17:47:02 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 27 Feb 2025 17:47:02 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception [v2] In-Reply-To: References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: On Thu, 27 Feb 2025 08:28:57 GMT, Marc Chevalier wrote: >> As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. >> >> The obvious fix that is >> >> Cell limit = local(_outer->max_locals()); >> for (Cell c = start_cell(); c < limit; c = next_cell(c)) { >> >> since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course >> >> Cell limit = (Cell)(_outer->max_locals()); >> >> would work, but it seems to break (the very light) abstraction. >> >> I've also added an assert to transform the UB into a clear failure. >> >> This fix makes the UB warning go away on Mac with arm64. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Introduce local_limit_cell Thanks @dean-long and @vnkozlov ------------- PR Comment: https://git.openjdk.org/jdk/pull/23772#issuecomment-2688668804 From duke at openjdk.org Thu Feb 27 17:47:02 2025 From: duke at openjdk.org (duke) Date: Thu, 27 Feb 2025 17:47:02 GMT Subject: RFR: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception [v2] In-Reply-To: References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: <-XvLBrimeRqoe9lDyuqQw_1wEtXxTrgPYyMPX01oCBM=.0115a0aa-7620-491e-ad5a-bae2ede42a4b@github.com> On Thu, 27 Feb 2025 08:28:57 GMT, Marc Chevalier wrote: >> As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. >> >> The obvious fix that is >> >> Cell limit = local(_outer->max_locals()); >> for (Cell c = start_cell(); c < limit; c = next_cell(c)) { >> >> since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course >> >> Cell limit = (Cell)(_outer->max_locals()); >> >> would work, but it seems to break (the very light) abstraction. >> >> I've also added an assert to transform the UB into a clear failure. >> >> This fix makes the UB warning go away on Mac with arm64. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Introduce local_limit_cell @marc-chevalier Your change (at version ff7461b54c7fc75da5e89febf6f5236807ee278b) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23772#issuecomment-2688671110 From pchilanomate at openjdk.org Thu Feb 27 17:49:07 2025 From: pchilanomate at openjdk.org (Patricio Chilano Mateo) Date: Thu, 27 Feb 2025 17:49:07 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 19 Feb 2025 00:37:14 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Stricter assertion on ppc64 Marked as reviewed by pchilanomate (Reviewer). src/hotspot/share/runtime/deoptimization.cpp line 645: > 643: methodHandle method(current, deopt_sender.interpreter_frame_method()); > 644: Bytecode_invoke cur(method, deopt_sender.interpreter_frame_bci()); > 645: if (!cur.is_invokedynamic() && MethodHandles::has_member_arg(cur.klass(), cur.name())) { I was confused with this new condition but I see is the same we have in `vframeArray::unpack_to_stack()`. ------------- PR Review: https://git.openjdk.org/jdk/pull/23557#pullrequestreview-2648596315 PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1974069132 From duke at openjdk.org Thu Feb 27 18:10:01 2025 From: duke at openjdk.org (Marc Chevalier) Date: Thu, 27 Feb 2025 18:10:01 GMT Subject: Integrated: 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception In-Reply-To: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> References: <4YvbDYnijRTuKW2Os3WjhsMwm1ngY3UBdBqPnpoyB4Y=.a9e95476-f79f-42e3-ba70-8263e0778f12@github.com> Message-ID: On Tue, 25 Feb 2025 10:11:54 GMT, Marc Chevalier wrote: > As guess on the JBS ticket, we have a UB when `_outer->max_locals() == 0`, because then we try to do `(Cell)(-1)` which is out of range since Cell's range is [0, `INT_MAX`]. > > The obvious fix that is > > Cell limit = local(_outer->max_locals()); > for (Cell c = start_cell(); c < limit; c = next_cell(c)) { > > since `local` asserts its argument to be in [0, `outer->max_locals()`). Of course > > Cell limit = (Cell)(_outer->max_locals()); > > would work, but it seems to break (the very light) abstraction. > > I've also added an assert to transform the UB into a clear failure. > > This fix makes the UB warning go away on Mac with arm64. > > Thanks, > Marc This pull request has now been integrated. Changeset: 2fd71561 Author: Marc Chevalier Committer: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/2fd71561107a5226f44e1732b646e43a82566eb3 Stats: 5 lines in 2 files changed: 3 ins; 0 del; 2 mod 8347426: Invalid value used for enum Cell in iTypeFlow::StateVector::meet_exception Reviewed-by: dlong, kvn ------------- PR: https://git.openjdk.org/jdk/pull/23772 From adinn at openjdk.org Thu Feb 27 18:07:03 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 27 Feb 2025 18:07:03 GMT Subject: RFR: JKK-8350893: Use generated names for hand generated opto runtime blobs Message-ID: The two special case opto runtime blobs that support uncommon trap and exception handling are currently being generated using hard wired blob names determined by port-specific code. They should employ the standard blob names generated from shared declarations in file stubDeclarations.hpp. ------------- Commit messages: - use generated names when creating opto runtime blobs Changes: https://git.openjdk.org/jdk/pull/23829/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23829&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350893 Stats: 29 lines in 9 files changed: 14 ins; 0 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/23829.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23829/head:pull/23829 PR: https://git.openjdk.org/jdk/pull/23829 From epeter at openjdk.org Thu Feb 27 18:13:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Feb 2025 18:13:07 GMT Subject: RFR: 8350858: [IR Framework] Some tests failed on Cascade Lake In-Reply-To: References: <55PX9f2wKFf0ixo_M34XUx2U0346AmMaqhg6T6VF_5A=.1bfb5b50-c646-485f-a42e-0ef864756f9e@github.com> Message-ID: On Thu, 27 Feb 2025 13:42:45 GMT, Emanuel Peter wrote: >> I found some vector tests failed on my Cascade Lake machine. >> >> The root cause is the CPU_SKYLAKE_PATTERN can not handle cpu with 2-digits stepping. The failed cpu model is >> >> lscpu >> Architecture: x86_64 >> CPU op-mode(s): 32-bit, 64-bit >> Byte Order: Little Endian >> CPU(s): 96 >> On-line CPU(s) list: 0-95 >> Thread(s) per core: 1 >> Core(s) per socket: 24 >> Socket(s): 4 >> NUMA node(s): 1 >> Vendor ID: GenuineIntel >> CPU family: 6 >> Model: 85 >> Model name: Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz >> Stepping: 11 >> CPU MHz: 2300.000 >> CPU max MHz: 4200.0000 >> CPU min MHz: 1000.0000 >> BogoMIPS: 4600.00 >> Virtualization: VT-x >> L1d cache: 32K >> L1i cache: 32K >> L2 cache: 1024K >> L3 cache: 33792K >> NUMA node0 CPU(s): 0-95 >> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities >> >> >> The fix is trival. >> >> PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? > > Looks good, thanks for the change! > @eme64 Thanks for your approve. > > PS, In vestigation, I found the vector size is different between auto-vectorization(32-bytes) and vector-api(64-bytes) in cascade cake machine. I'm wondering if it's a legacy code. And can we make them as a unique size? > > Can you help check this question? If I remember correctly, there are some machines where large vectors led to slowdowns. But I don't know the details, I think you'd have to do some research yourself on this. These things were decided before my time. Maybe intel folks know more here. If you find some more details / code snippets, that could also be a start ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23824#issuecomment-2688729298 From kxu at openjdk.org Thu Feb 27 18:17:19 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Thu, 27 Feb 2025 18:17:19 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v4] In-Reply-To: References: Message-ID: > [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. > > When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) > > The following was implemented to address this issue. > > if (UseNewCode2) { > *multiplier = bt == T_INT > ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows > : ((jlong) 1) << con->get_int(); > } else { > *multiplier = ((jlong) 1 << con->get_int()); > } > > > Two new bitshift overflow tests were added. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: update license header year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23506/files - new: https://git.openjdk.org/jdk/pull/23506/files/0ffae804..f570024d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23506.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23506/head:pull/23506 PR: https://git.openjdk.org/jdk/pull/23506 From kxu at openjdk.org Thu Feb 27 18:17:25 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Thu, 27 Feb 2025 18:17:25 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v3] In-Reply-To: <0veFkmXd54EJUWuPKG7GyAeAPwh0o1epmY3L1zi2NFM=.dd175f75-09e6-4f3f-8214-59e4cd0a0de2@github.com> References: <0veFkmXd54EJUWuPKG7GyAeAPwh0o1epmY3L1zi2NFM=.dd175f75-09e6-4f3f-8214-59e4cd0a0de2@github.com> Message-ID: <45nD3v899rc6DPcZ9wlvg_ysS1YYifNbks17CEDjong=.9dd0c48b-f119-4fdf-a053-b963738fbb48@github.com> On Thu, 27 Feb 2025 16:14:22 GMT, Kangcheng Xu wrote: >> [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. >> >> When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) >> >> The following was implemented to address this issue. >> >> if (UseNewCode2) { >> *multiplier = bt == T_INT >> ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows >> : ((jlong) 1) << con->get_int(); >> } else { >> *multiplier = ((jlong) 1 << con->get_int()); >> } >> >> >> Two new bitshift overflow tests were added. > > Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 49 commits: > > - Merge branch 'master' into arithmetic-canonicalization > - added bit shift operations accept an explicit BasicType > - use explicit argument types for overloaded java_shift_left() > - use java_shift_left() > - remove UseNewCode > - Merge branch 'master' into arithmetic-canonicalization > - fix serial addition regression > - remove trailing empty comments > - fix comment grammar > > Co-authored-by: Christian Hagedorn > - remove matching power-of-2 subtractions since it's already handled by Identity() > - ... and 39 more: https://git.openjdk.org/jdk/compare/8323ddfe...0ffae804 Added `java_shift_left(jlong, jint, BasicType bt)` (and `java_shift_right(...)`, `java_shift_right_unsigned(...)` for the sake of consistency). All tests are passing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23506#issuecomment-2688734241 From shade at openjdk.org Thu Feb 27 20:04:57 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 27 Feb 2025 20:04:57 GMT Subject: RFR: 8350893: Use generated names for hand generated opto runtime blobs In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 18:00:37 GMT, Andrew Dinn wrote: > The two special case opto runtime blobs that support uncommon trap and exception handling are currently being generated using hard wired blob names determined by port-specific code. They should employ the standard blob names generated from shared declarations in file stubDeclarations.hpp. Hm. It looks like in release builds all of these would be "runtime stub"? Is that expected? const char* OptoRuntime::stub_name(address entry) { #ifndef PRODUCT CodeBlob* cb = CodeCache::find_blob(entry); RuntimeStub* rs =(RuntimeStub *)cb; assert(rs != nullptr && rs->is_runtime_stub(), "not a runtime stub"); return rs->name(); #else // Fast implementation for product mode (maybe it should be inlined too) return "runtime stub"; #endif } ------------- PR Review: https://git.openjdk.org/jdk/pull/23829#pullrequestreview-2648960883 From chagedorn at openjdk.org Thu Feb 27 20:14:08 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 27 Feb 2025 20:14:08 GMT Subject: RFR: 8350579: Remove Template Assertion Predicates belonging to a loop once it is folded away during IGVN In-Reply-To: <5Nbgi31ds2bXEF3Uc9AL5DyAOUdmme6DCdvly0aY-60=.92863b9e-00b1-4894-87d1-1f460c8d5b20@github.com> References: <5Nbgi31ds2bXEF3Uc9AL5DyAOUdmme6DCdvly0aY-60=.92863b9e-00b1-4894-87d1-1f460c8d5b20@github.com> Message-ID: On Thu, 27 Feb 2025 13:07:46 GMT, Christian Hagedorn wrote: > The patch fixes the issue of creating an Initialized Assertion Predicate at a loop X from a Template Assertion Predicate that was originally created for a loop Y. Using the unrelated loop values from loop Y for the Initialized Assertion Predicate will let it fail during runtime and we execute a `halt` instruction. This was originally reported with [JDK-8305428](https://bugs.openjdk.org/browse/JDK-8305428). > > Note that most of the line changes are from new tests. > > ### The Problem > There are multiple test cases triggering the same problem. In the following, when referring to "the test case", I'm referring to `testTemplateAssertionPredicateNotRemovedHalt()` which was written from scratch and contains more detailed comments explaining how we end up with executing a `Halt` node in more details. > > #### An Inner Loop without Parse Predicates > The graph in `testTemplateAssertionPredicateNotRemovedHalt()` looks like this after creating `LoopNodes` for the outer `for` and inner `while (true)` loop: > > ![image](https://github.com/user-attachments/assets/7ac60e35-0b7e-4f04-b9dd-6eb8c8654a15) > > We only have Parse Predicates for the outer loop. Why? > > Before beautify loop, we have the following region which merges multiple backedges - the one from the `for` loop and the one from the `while (true)` loop: > > ![image](https://github.com/user-attachments/assets/7895161d-5ac1-46d6-93fe-5ab90ef24ab9) > > In `IdealLoopTree::merge_many_backedges()`, we notice that the hottest backedge is hot enough such that it is worth to have a separate merge point region for the inner and outer loop. We set everything up and eventually in `IdealLoopTree::split_outer_loop()`, we create a second `LoopNode`. > > For this inner `LoopNode`, we cannot set up `Parse Predicates` with the same UCTs as used for the outer loop. It would be incorrect when taking the trap to re-execute the inner and outer loop again while having already executed some of the outer loop's iterations. Thus, we get the graph shape with back-to-back `LoopNodes` as shown above. > > #### Predicates from a Folded Loop End up at Another Loop > As described in the previous section, we have an inner and outer `LoopNode` while the inner does not have Parse Predicates. In a series of events (see test case comments for more details), we first hoist a range check out of the outer loop during Loop Predication with a Template Assertion Predicate. Then, we fold the outer loop away because we find that it is only running for a single iteration and the bac... Thanks Roland for having a look. I agree that it is indeed quite fragile. It started out with a quite simple fix but then I found more and more cases with fuzzing where we have some weird in-between states in IGVN while a predicate is being folded where matching failed. I was not super happy with matching predicates during IGVN which is difficult and error-prone to get right. > So I'm wondering if there would be a way to mark predicates as being for a particular loop (maybe storing the loop's node id they apply to in predicate nodes and making sure it's properly updated as loops are cloned etc.) so when there is a mismatch between the loop and predicate it can be detected? That's an interesting idea that could work more reliably. Let me think about that more. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23823#issuecomment-2689007300 From liach at openjdk.org Thu Feb 27 21:26:03 2025 From: liach at openjdk.org (Chen Liang) Date: Thu, 27 Feb 2025 21:26:03 GMT Subject: RFR: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. I think I should withdraw this patch: the multi-byte access is used for 2 purposes: to read actual larger data or just to read multiple bytes for performance, and ByteArray is for the former. We should probably look at vectors for the multiple byte access purpose. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23478#issuecomment-2689141951 From liach at openjdk.org Thu Feb 27 21:26:03 2025 From: liach at openjdk.org (Chen Liang) Date: Thu, 27 Feb 2025 21:26:03 GMT Subject: Withdrawn: 8349503: Consolidate multi-byte access into ByteArray In-Reply-To: References: Message-ID: On Wed, 5 Feb 2025 23:41:19 GMT, Chen Liang wrote: > `MethodHandles.byteArrayViewVarHandle` exposes checked multi-byte access to byte arrays via VarHandle. This larger access speeds up many operations, yet it cannot be used in early bootstrap, and as a result, people tend to use `Unsafe` which can threaten memory safety of the Java Platform. > > To promote the safe use of multi-byte access, I propose to move the checked implementations from VarHandle to ByteArray to allow earlier use and reduce maintenance costs. In addition, ByteArrayLittleEndian is consolidated, and now the access methods are distinguished by BO (byte order) / BE (big endian) / LE (little endian) suffixes to indicate their access features. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/23478 From psandoz at openjdk.org Thu Feb 27 23:33:04 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Thu, 27 Feb 2025 23:33:04 GMT Subject: RFR: 8350748: VectorAPI: Method "checkMaskFromIndexSize" should be force inlined In-Reply-To: <18Q2Zl2ip_eFS_Y4fflgS8XYBkbwCZ468DIjP3KwhDE=.240f4182-4b02-4fac-97c8-ac659427e4a8@github.com> References: <18Q2Zl2ip_eFS_Y4fflgS8XYBkbwCZ468DIjP3KwhDE=.240f4182-4b02-4fac-97c8-ac659427e4a8@github.com> Message-ID: On Thu, 27 Feb 2025 06:43:19 GMT, Xiaohong Gong wrote: > Method `checkMaskFromIndexSize` is called by some vector masked APIs like `fromArray/intoArray/fromMemorySegment/...`. It is used to check whether the index of any active lanes in a mask will reach out of the boundary of the given Array/MemorySegment. This function should be force inlined, or a VectorMask object is generated once the function call is not inlined by C2 compiler, which affects the API performance a lot. > > This patch changed to call the `VectorMask.checkFromIndexSize` method directly inside of these APIs instead of `checkMaskFromIndexSize`. Since it has added the `@ForceInline` annotation already, it will be inlined and intrinsified by C2. And then the expected vector instructions can be generated. With this change, the unused `checkMaskFromIndexSize` can be removed. > > Performance of some JMH benchmarks can improve up to 14x on a NVIDIA Grace CPU (AArch64 SVE2, 128-bit vectors). We can also observe the similar performance improvement on a Intel CPU which supports AVX512. > > Following is the performance data on Grace: > > > Benchmark Mode Cnt Units Before After Gain > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE thrpt 30 ops/ms 31544.304 31610.598 1.002 > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE thrpt 30 ops/ms 3896.202 3903.249 1.001 > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE thrpt 30 ops/ms 570.415 7174.320 12.57 > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE thrpt 30 ops/ms 566.694 7193.520 12.69 > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE thrpt 30 ops/ms 3899.269 3878.258 0.994 > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE thrpt 30 ops/ms 1134.301 16053.847 14.15 > StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE thrpt 30 ops/ms 26449.558 28699.480 1.085 > StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE thrpt 30 ops/ms 1922.167 5781.077 3.007 > StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE thrpt 30 ops/ms 3784.190 11789.276 3.115 > StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE thrpt 30 ops/ms 3694.082 15633.547 4.232 > StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE thrpt 30 ops/ms 1966.956 6049.790 3.075 > StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE thrpt 30 ops/ms 7647.309 27412.387 3.584 Marked as reviewed by psandoz (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23817#pullrequestreview-2649322190 From duke at openjdk.org Fri Feb 28 00:10:11 2025 From: duke at openjdk.org (duke) Date: Fri, 28 Feb 2025 00:10:11 GMT Subject: Withdrawn: 8336759: C2: int counted loop with long limit not recognized as counted loop In-Reply-To: <_d_CiLfCN9ahEmhp9fLcGqO-L8n7a0gW86R3lzLkX60=.b3bdc697-cb2d-477f-a525-0f16a3eee383@github.com> References: <_d_CiLfCN9ahEmhp9fLcGqO-L8n7a0gW86R3lzLkX60=.b3bdc697-cb2d-477f-a525-0f16a3eee383@github.com> Message-ID: <4I7ef3XiaIMPxLSREzbeG4RhmKwmk8iLWtS-YfnuaKM=.f11c2505-0732-4a9c-90db-991e9af8b2bf@github.com> On Fri, 29 Nov 2024 01:08:04 GMT, Kangcheng Xu wrote: > This patch implements [JDK-8336759](https://bugs.openjdk.org/browse/JDK-8336759) that recognizes int counted loops with long limits. > > Currently, patterns like `for ( int i =...; i < long_limit; ...)` where int `i` is implicitly promoted to long (i.e., `(long) i < long_limit`) is not recognized as (int) counted loop. This patch speculatively and optimistically converts long limits to ints and deoptimize if the limit is outside int range, allowing more optimization opportunities. > > In other words, it transforms > > > for (int i = 0; (long) i < long_limit; i++) {...} > > > to > > > if (int_min <= long_limit && long_limit <= int_max ) { > for (int i = 0; i < (int) long_limit; i++) {...} > } else { > trap: loop_limit_check > } > > > This could benefit calls to APIs like `long MemorySegment#byteSize()` when iterating over a long limit. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/22449 From cjplummer at openjdk.org Fri Feb 28 01:19:58 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Fri, 28 Feb 2025 01:19:58 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v4] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 18:06:19 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > More friends. src/jdk.hotspot.agent/doc/index.html line 43: > 41:

> 42: > 43:

Compilation Replay

clhsdb.html also needs to be updated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23782#discussion_r1974577106 From fyang at openjdk.org Fri Feb 28 01:27:05 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Feb 2025 01:27:05 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v7] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 12:00:47 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> Currently, `string_compare` code is a bit complicated, main reasons include: >> 1. it has 2 piece of code respectively for LU and UL case, this is not necessary, basically LU and UL behaviour quite similar. >> 2. it mixed LL/UU and LU/UL case together, better to separate them, as they are quite different from each other. >> >> This is not good for code reading and maintaining. >> >> >> So, this patch does following refactoring: >> 1. merge LU and UL code into one, i.e. remove UL code. >> 2. seperate the code into 2 methods: LL/UU and LU/UL. >> 3. some other misc improvement. >> >> I could do the following refactoring in another following pr, as in this pr I'm just moving code and removing code, it's easier to do it and review it. In particular the first one, as it needs to rewrite the existing code for UL/LU case. >> 1. move alignment code of `generate_compare_long_string_different_encoding` upwards to `string_compare_long_LU`. >> 2. make `SHORT_STRING` case simpler. >> >> >> >> Thanks > > Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: > > - check short string > - rename src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1393: > 1391: > 1392: const int base_offset = isLL ? arrayOopDesc::base_offset_in_bytes(T_BYTE) > 1393: : arrayOopDesc::base_offset_in_bytes(T_CHAR); Sorry, But I don't think we need to distinuish `T_BYTE` and `T_CHAR` here. The reason is that the character storage used for both Latin1 and UTF16 strings is always a byte array [1]. So we should always use `T_BYTE` for both cases. [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L160 src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1487: > 1485: Label TAIL, NEXT_WORD, DIFFERENCE; > 1486: > 1487: const int base_offset = arrayOopDesc::base_offset_in_bytes(T_CHAR); Similar here. Use `T_BYTE` instead of `T_CHAR`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1974591975 PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1974592263 From kvn at openjdk.org Fri Feb 28 02:42:57 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Feb 2025 02:42:57 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v4] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 18:06:19 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > More friends. src/hotspot/share/runtime/vmStructs.cpp line 706: > 704: volatile_nonstatic_field(MonitorList, _head, ObjectMonitor*) \ > 705: \ > 706: unchecked_c2_static_field(Matcher, _regEncode, sizeof(Matcher::_regEncode)) /* NOTE: no type */ \ I don't see usage in SA of `VMReg::regEncode()` which access this field. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23782#discussion_r1974673849 From amitkumar at openjdk.org Fri Feb 28 02:59:52 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 28 Feb 2025 02:59:52 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Thu, 27 Feb 2025 09:19:50 GMT, Amit Kumar wrote: >> When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. >> AIX crash is (linux ppc64le crash is similar) : >> >> >> # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 >> # Error: ShouldNotReachHere() >> # >> >> iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) >> lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) >> sp: 0x000000011023aab0 (base - 0x2DD8) >> rtoc: 0x08001000a0088ff0 >> |---stackaddr----| |----lrsave------|: >> 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) >> 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) >> 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) >> 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) >> 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) >> 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) >> 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) >> 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 p... > > Ok I ran another build with this configuration and still build was successful : > > bash configure \ > --with-boot-jdk=$HOME/boot_jdk_23 \ > --with-jtreg=$HOME/jtreg \ > --with-gtest=$HOME/googletest \ > --with-jmh=build/jmh/jars \ > --with-jvm-variants=minimal \ > --with-jvm-features=minimal \ > --with-debug-level=fastdebug \ > --with-native-debug-symbols=internal \ > --disable-precompiled-headers > Hi @offamitkumar this is a bit surprising because the matcher header is only included if COMPILER2 is defined https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/s390/compiledIC_s390.cpp#L32 and this should not be the case in minimal JVM , but if this somehow works on your side, then fine :-) ! I have finished tier1 on minimal build, will see if this issue gets reproduced by any of the tests. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23794#issuecomment-2689596308 From amitkumar at openjdk.org Fri Feb 28 02:59:51 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 28 Feb 2025 02:59:51 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Wed, 26 Feb 2025 09:23:25 GMT, Matthias Baesken wrote: > When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. > AIX crash is (linux ppc64le crash is similar) : > > > # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 > # Error: ShouldNotReachHere() > # > > iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) > lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) > sp: 0x000000011023aab0 (base - 0x2DD8) > rtoc: 0x08001000a0088ff0 > |---stackaddr----| |----lrsave------|: > 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) > 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) > 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) > 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) > 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) > 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) > 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bc50 - 0x090000... Looks Good. ------------- Marked as reviewed by amitkumar (Committer). PR Review: https://git.openjdk.org/jdk/pull/23794#pullrequestreview-2649595989 From cjplummer at openjdk.org Fri Feb 28 04:15:58 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Fri, 28 Feb 2025 04:15:58 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v4] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 18:06:19 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > More friends. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/CompilerThread.java line 48: > 46: private static synchronized void initialize(TypeDataBase db) throws WrongTypeException { > 47: Type type = db.lookupType("CompilerThread"); > 48: Line 47 above is no longer needed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23782#discussion_r1974736433 From epeter at openjdk.org Fri Feb 28 06:54:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 28 Feb 2025 06:54:01 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v13] In-Reply-To: <6XEmUiapz_UElQlM-x5g61YOp2DSqSh3b0Vdq1jsWx8=.c7cd0945-47e4-49cb-bbd4-e6b7bc06c743@github.com> References: <6XEmUiapz_UElQlM-x5g61YOp2DSqSh3b0Vdq1jsWx8=.c7cd0945-47e4-49cb-bbd4-e6b7bc06c743@github.com> Message-ID: On Wed, 26 Feb 2025 12:22:46 GMT, kuaiwei wrote: >> This patch enhance MergeStores optimization to support merge value with reverse byte order. >> >> Below is benchmark result before and after the patch: >> >> On aliyun g8y (aarch64) >> |name | before | score2 | ratio | >> |---|---|---|---| >> |MergeStoreBench.setCharBS |5669.655000 |5669.566000 | 0.00 %| >> |MergeStoreBench.setCharBV |5516.911000 |5516.273000 | 0.01 %| >> |MergeStoreBench.setCharC |5578.644000 |5552.809000 | 0.47 %| >> |MergeStoreBench.setCharLS |5782.140000 |5779.264000 | 0.05 %| >> |MergeStoreBench.setCharLV |5496.403000 |5499.195000 | -0.05 %| >> |MergeStoreBench.setIntB |6087.703000 |2768.385000 | 119.90 %| >> |MergeStoreBench.setIntBU |6733.813000 |2950.240000 | 128.25 %| >> |MergeStoreBench.setIntBV |1362.233000 |1361.821000 | 0.03 %| >> |MergeStoreBench.setIntL |2834.785000 |2833.042000 | 0.06 %| >> |MergeStoreBench.setIntLU |2947.145000 |2946.874000 | 0.01 %| >> |MergeStoreBench.setIntLV |5506.791000 |5506.229000 | 0.01 %| >> |MergeStoreBench.setIntRB |7634.279000 |5611.058000 | 36.06 %| >> |MergeStoreBench.setIntRBU |7766.737000 |5551.281000 | 39.91 %| >> |MergeStoreBench.setIntRL |5689.793000 |5689.385000 | 0.01 %| >> |MergeStoreBench.setIntRLU |5628.287000 |5628.789000 | -0.01 %| >> |MergeStoreBench.setIntRU |5536.039000 |5534.910000 | 0.02 %| >> |MergeStoreBench.setIntU |5595.363000 |5567.810000 | 0.49 %| >> |MergeStoreBench.setLongB |13722.671000 |6811.098000 | 101.48 %| >> |MergeStoreBench.setLongBU |13728.844000 |4280.240000 | 220.75 %| >> |MergeStoreBench.setLongBV |2785.255000 |2785.949000 | -0.02 %| >> |MergeStoreBench.setLongL |5714.615000 |5710.402000 | 0.07 %| >> |MergeStoreBench.setLongLU |4128.746000 |4129.324000 | -0.01 %| >> |MergeStoreBench.setLongLV |2793.125000 |2794.438000 | -0.05 %| >> |MergeStoreBench.setLongRB |14465.223000 |7015.050000 | 106.20 %| >> |MergeStoreBench.setLongRBU |14546.954000 |6173.210000 | 135.65 %| >> |MergeStoreBench.setLongRL |6816.145000 |6813.348000 | 0.04 %| >> |MergeStoreBench.setLongRLU |4289.445000 |4284.239000 | 0.12 %| >> |MergeStoreBench.setLongRU |3132.471000 |3133.093000 | -0.02 %| >> |MergeStoreBench.setLongU |3086.779000 |3087.298000 | -0.02 %| >> >> AMD EPYC 9T24 >> ... > > kuaiwei has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 19 commits: > > - Merge remote-tracking branch 'origin/master' into pr/merge_stores_reverse > - Add readable comment > - Fix for review comments > - Allow ValueOrder::Reverse on big-endian platforms > - Revert "Merge more stores" > > This reverts commit 1e1113ed02ec5a9fe181f215d5667e8de487fe47. > - Revert "Fix test502aBE" > > This reverts commit f773fa368577c4f67957c4d40968c5c45e3ae205. > - Fix test502aBE > - Merge more stores > - Remove an useless assertion > - Remove tailing white space > - ... and 9 more: https://git.openjdk.org/jdk/compare/aac9cb45...b3243a56 Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23030#pullrequestreview-2649895739 From epeter at openjdk.org Fri Feb 28 06:54:02 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 28 Feb 2025 06:54:02 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v12] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 12:22:16 GMT, kuaiwei wrote: >> kuaiwei has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix for review comments > > Merged with master @kuaiwei Do you have github actions enabled? Because you only have 1 check passed, and usually there are about 16: ![image](https://github.com/user-attachments/assets/2e26c2c9-f272-4045-90f7-df8e6ab22066) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2689872716 From duke at openjdk.org Fri Feb 28 06:58:03 2025 From: duke at openjdk.org (Matthias Ernst) Date: Fri, 28 Feb 2025 06:58:03 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 07:22:56 GMT, Emanuel Peter wrote: >> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision: >> >> incorporate @eme64's comment suggestions > > About severity: As long as we find and integrate a fix during `JDK25` it is fine (the issue does not break the CI that badly at the moment). If we get close to rampdown, then we have to consider if we want to backout 8346664 or if we defer the bug to `JDK26`. The good news: it's easy to reproduce the issue by adding a loop to CheckProp, and I have a fix up for CCP that appears(!) straightforward. I do not quite understand the interplay with IGVN, the set of "push" rules does not look consistent between the two phases to me. But re-reading @eme64's response, CCP is necessary for correctness, IGVN is optimization potential(?). That said, I have taken up new employment and need to sort out the contribution framework first before I can contribute the fix. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2689878767 From qamai at openjdk.org Fri Feb 28 07:24:08 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 28 Feb 2025 07:24:08 GMT Subject: RFR: 8346664: C2: Optimize mask check with constant offset [v24] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 06:54:52 GMT, Matthias Ernst wrote: >> About severity: As long as we find and integrate a fix during `JDK25` it is fine (the issue does not break the CI that badly at the moment). If we get close to rampdown, then we have to consider if we want to backout 8346664 or if we defer the bug to `JDK26`. > > The good news: it's easy to reproduce the issue by adding a loop to CheckProp, and I have a fix up for CCP that appears(!) straightforward. I do not quite understand the interplay with IGVN, the set of "push" rules does not look consistent between the two phases to me. But re-reading @eme64's response, CCP is necessary for correctness, IGVN is optimization potential(?). > > That said, I have taken up new employment and need to sort out the contribution framework first before I can contribute the fix. @mernst-github IGVN and CCP try to infer information about a node from its inputs. As a result, when an input of a node is changed by IGVN or CCP, you need to do the inference on that node again. Theoretically, you need to push all nodes below a changed node to the worklist, but doing so is expensive and unnecessary. So, we use the fairly convoluted approach of pushing all immediate uses, and indirect uses are pushed in an ad-hoc manner depending on the way particular nodes do their inference. For example, in this case, since AndNode::Value looks through a LShiftNode, you need to do the opposite when pushing them to the IGVN worklist, i.e. pushing all AndNode that is a use of a direct use LShiftNode of the current node. > I do not quite understand the interplay with IGVN, the set of "push" rules does not look consistent between the two phases to me. It is because IGVN uses Ideal, Identity, and Value, which means that the pushing logic needs to take into consideration the inference of all these methods. While CCP only uses Value, so the pushing logic only needs to know which nodes the Value methods try to look through. > CCP is necessary for correctness, IGVN is optimization potential(?). Not really, we assume a graph is stable after IGVN, so missed idealisation here can also be considered a bug (especially missed transformations with CFG nodes). The difference is that these assumptions do not practically cover all nodes so we may get away with some of them. > That said, I have taken up new employment and need to sort out the contribution framework first before I can contribute the fix. Don't worry, RDP1 of JDK-25 is not so urgent. So take your time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2689917878 From mbaesken at openjdk.org Fri Feb 28 07:38:56 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 28 Feb 2025 07:38:56 GMT Subject: RFR: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: On Wed, 26 Feb 2025 09:23:25 GMT, Matthias Baesken wrote: > When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. > AIX crash is (linux ppc64le crash is similar) : > > > # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 > # Error: ShouldNotReachHere() > # > > iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) > lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) > sp: 0x000000011023aab0 (base - 0x2DD8) > rtoc: 0x08001000a0088ff0 > |---stackaddr----| |----lrsave------|: > 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) > 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) > 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) > 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) > 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) > 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) > 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bc50 - 0x090000... Thanks for the reviews ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23794#issuecomment-2689941464 From mbaesken at openjdk.org Fri Feb 28 07:38:57 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 28 Feb 2025 07:38:57 GMT Subject: Integrated: 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms In-Reply-To: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> References: <0sN9yVAeol7mbd5UA0gJJR-53PhuRBZoLpomkS0t3Qg=.b5eb193c-a771-4a55-a628-f84ebfafd084@github.com> Message-ID: <75bo8qKbJgrQ1Kg8KlhOepE8Ap2ISjby4jyTs2-grao=.94806c1b-d143-4a8b-9c7e-bbcdf994f3c7@github.com> On Wed, 26 Feb 2025 09:23:25 GMT, Matthias Baesken wrote: > When building a JVM without C2 (e.g. minimal) on ppc64 platforms , it crashes in the build because of unwanted dependencies to C2. > AIX crash is (linux ppc64le crash is similar) : > > > # Internal Error (compiledIC_ppc.cpp:141), pid=17695018, tid=258 > # Error: ShouldNotReachHere() > # > > iar: 0x0900000008800c60 libjvm.so::AixNativeCallstack::print_callstack_for_context(outputStream*, ucontext_t const*, bool, char*, unsigned long)+0x4bc (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:18 fixedparms:5 parmsonstk:1) > lr: 0x09000000087ea9b8 libjvm.so::fdStream::write(char const*, unsigned long)+0x44 (C++ uses_alloca saves_lr stores_bc gpr_saved:4 fixedparms:3 parmsonstk:1) > sp: 0x000000011023aab0 (base - 0x2DD8) > rtoc: 0x08001000a0088ff0 > |---stackaddr----| |----lrsave------|: > 0x000000011023aea0 - 0x0900000008800730 libjvm.so::os::Aix::platform_print_native_stack(outputStream*, void const*, char*, int, unsigned char*&)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:5 parmsonstk:1) > 0x000000011023af20 - 0x0900000008800644 libjvm.so::NativeStackPrinter::print_stack(outputStream*, char*, int, unsigned char*&, bool, int)+0x60 (C++ uses_alloca saves_cr saves_lr stores_bc gpr_saved:6 fixedparms:7 parmsonstk:1) > 0x000000011023afc0 - 0x09000000087f6ff8 libjvm.so::VMError::report(outputStream*, bool)+0x11f0 (C++ fp_present uses_alloca saves_cr saves_lr stores_bc gpr_saved:13 fixedparms:2 parmsonstk:1) > 0x000000011023b830 - 0x09000000087e9fdc libjvm.so::VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x6a0 (C++ uses_alloca saves_lr stores_bc gpr_saved:18 fixedparms:8 parmsonstk:1) > 0x000000011023ba10 - 0x09000000087e96a0 libjvm.so::report_vm_error(char const*, int, char const*, char const*, ...)+0xa0 (C++ uses_alloca saves_lr stores_bc gpr_saved:5 fixedparms:4 parmsonstk:1) > 0x000000011023bad0 - 0x09000000087e95cc libjvm.so::report_vm_error(char const*, int, char const*)+0x24 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:3 parmsonstk:1) > 0x000000011023bb50 - 0x09000000087e956c libjvm.so::report_should_not_reach_here(char const*, int)+0x20 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bbd0 - 0x0900000008906e5c libjvm.so::CompiledDirectCall::emit_to_interp_stub(MacroAssembler*, unsigned char*)+0x28 (C++ uses_alloca saves_lr stores_bc gpr_saved:1 fixedparms:2 parmsonstk:1) > 0x000000011023bc50 - 0x090000... This pull request has now been integrated. Changeset: 2af76de0 Author: Matthias Baesken URL: https://git.openjdk.org/jdk/commit/2af76de05a50dee052307b8b82055a4787e96df9 Stats: 11 lines in 1 file changed: 0 ins; 8 del; 3 mod 8350683: Non-C2 / minimal JVM crashes in the build on ppc64 platforms Co-authored-by: Martin Doerr Reviewed-by: mdoerr, amitkumar ------------- PR: https://git.openjdk.org/jdk/pull/23794 From chagedorn at openjdk.org Fri Feb 28 08:07:55 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Feb 2025 08:07:55 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v5] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 21:13:36 GMT, Marc Chevalier wrote: >> Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Factor testing whether a node is a data proj of a pure function Update looks good. One more suggestion for the IR tests. test/hotspot/jtreg/compiler/c2/irTests/ModDNodeTests.java line 124: > 122: > 123: @Test > 124: @IR(failOn = {"drem"}, phase = CompilePhase.BEFORE_MATCHING) I thought about this again and I would generally be as precise as possible and do the matching on the first `CompilePhase` where we expect the node to be removed (if such a unique `CompilePhase` exists or pick one that is as early as possible and is unique). Moreover, it would be good to have a separate `IRNode` entry for `ModF/D`. I skimmed through the `IRNode` file and it seems that we miss a `macroNodes()` specific method that makes sure we only match macro nodes on the corresponding `CompilePhases`. You could add the following to `IRNode.java` (untested): public static final String MOD_F = PREFIX + "MOD_F" + POSTFIX; static { macroNodes(MOD_F, "ModF"); } public static final String MOD_D = PREFIX + "MOD_D" + POSTFIX; static { macroNodes(MOD_D, "ModD"); } ... /** * Apply {@code regex} on all ideal graph phases up to and including {@link CompilePhase#BEFORE_MACRO_EXPANSION}. */ private static void macroNodes(String irNodePlaceholder, String regex) { IR_NODE_MAPPINGS.put(irNodePlaceholder, new SinglePhaseRangeEntry(CompilePhase.BEFORE_MACRO_EXPANSION, regex, CompilePhase.BEFORE_STRINGOPTS, CompilePhase.BEFORE_MACRO_EXPANSION)); } `macroNodes()` could be added after `allocNodes()`, for example. This allows you to update this IR rule, for example, to: @IR(failOn = IRNode.MOD_D, phase = CompilePhase.ITER_GVN1) I suggest to also add a positive rule to check that the IR actually contained the node that you now assume is gone: @IR(counts = {IRNode.MOD_D, "1"}, phase = CompilePhase.AFTER_PARSING) test/hotspot/jtreg/compiler/c2/irTests/ModDNodeTests.java line 139: > 137: > 138: @Test > 139: @IR(failOn = {"drem"}, phase = CompilePhase.BEFORE_MATCHING) This could then be matched on `Compile.PHASEIDEALLOOP1` and the positive rule on `CompilePhase.BEFORE_CLOOPS`, for example. ------------- PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2649984896 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1974948377 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1974975303 From dlunden at openjdk.org Fri Feb 28 09:00:40 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 28 Feb 2025 09:00:40 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v5] In-Reply-To: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: > When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. > > ### Changeset > > It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. > > To illustrate the idealization and how it resolves this issue, consider the example below. > > ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) > > `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. > > We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. > > The changeset consists of the following changes. > - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. > - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. > - Add multiple new regression tests in `TestGCMLoadPlacement.java`. > > For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/13394882532) > - `tier1` to `tier4` (an... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update after Christian's review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23691/files - new: https://git.openjdk.org/jdk/pull/23691/files/8e009abe..892bf5f6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23691&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23691&range=03-04 Stats: 97 lines in 3 files changed: 46 ins; 41 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/23691.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23691/head:pull/23691 PR: https://git.openjdk.org/jdk/pull/23691 From dlunden at openjdk.org Fri Feb 28 09:00:40 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 28 Feb 2025 09:00:40 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v4] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: On Tue, 25 Feb 2025 19:11:15 GMT, Daniel Lund?n wrote: >> When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. >> >> ### Changeset >> >> It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. >> >> To illustrate the idealization and how it resolves this issue, consider the example below. >> >> ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) >> >> `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. >> >> We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. >> >> The changeset consists of the following changes. >> - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. >> - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. >> - Add multiple new regression tests in `TestGCMLoadPlacement.java`. >> >> For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/ac... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Fix subtle bug introduced in previous update Thanks for the reviews! > Thanks for the credit! Maybe you also want to add Emanuel as contributor since he also joined in the discussions :-) Thanks for reminding me. Also adding @merykitty for previous involvement in discussions. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23691#issuecomment-2690083194 From dlunden at openjdk.org Fri Feb 28 09:00:41 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 28 Feb 2025 09:00:41 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v4] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: <-kpJ15v9YZsu9G5Cr75jeHmdplV0H9x0t1t3xYhyw6A=.3f569932-dbd5-4e54-920f-368560f2c808@github.com> On Thu, 27 Feb 2025 14:21:23 GMT, Christian Hagedorn wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix subtle bug introduced in previous update > > src/hotspot/share/opto/cfgnode.cpp line 2393: > >> 2391: // non-termination. >> 2392: uint merge_width = 0; >> 2393: bool split_must_terminate = false; // Is splitting guaranteed to terminate? > > Needed to read it twice to understand it. How about `split_always_terminates` instead to make it more clear? Fine with me, now changed. > src/hotspot/share/opto/cfgnode.cpp line 2435: > >> 2433: ResourceMark rm; >> 2434: VectorSet visited; >> 2435: Node_List worklist; > > You could also use a `Unique_Node_List` instead of a `Node_List` + `VectorSet`: > > Unique_Node_List worklist; > for (uint i = 0; i < worklist.size(); i++) { > } Correct me if I'm wrong, but doesn't this pattern lead to higher memory consumption because we need to keep already visited nodes in the worklist? It probably doesn't matter much in practice, but I find the current approach more hygienic. > src/hotspot/share/opto/cfgnode.cpp line 2448: > >> 2446: }; >> 2447: split_must_terminate = true; // Assume no circularity until proven otherwise. >> 2448: while (split_must_terminate && worklist.size() > 0) { > > Seems like you have `split_must_terminate` in this condition because the `break` in the `else` path does not exit both loops. When using a separate method, you could directly return false when finding that `split_must_terminate` is false. Then you can remove `split_must_terminate` from this exit condition. Yes, extracting the termination check to a separate function makes it possible to simplify like you suggest. Thanks! > src/hotspot/share/opto/cfgnode.cpp line 2469: > >> 2467: } >> 2468: } >> 2469: } > > Would it make sense to extract this to a separate method `is_split_through_mergemem_terminating()` (or something similar) which returns the value for `split_must_terminate`? The `PhiNode::Ideal()` method is already extremely large. Yes, sure. Now extracted! > test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java line 30: > >> 28: * @bug 8333393 >> 29: * @summary Test that loads are not scheduled too late. >> 30: * @run main/othervm -XX:+UnlockDiagnosticVMOptions > > `PerMethodTrapLimit` is a pure product flag, so you can remove `-XX:+UnlockDiagnosticVMOptions`. Same below where you don't use `Stress*` flags (better double check again with a product build to be safe :-) ) Thanks, fixed (and I will rerun tests after reviews). > test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java line 31: > >> 29: * @summary Test that loads are not scheduled too late. >> 30: * @run main/othervm -XX:+UnlockDiagnosticVMOptions >> 31: * -XX:CompileCommand=quiet > > Is `quiet` required? Not really, removed! > test/hotspot/jtreg/compiler/codegen/TestGCMLoadPlacement.java line 101: > >> 99: >> 100: int test() { >> 101: for (int i = 0; i < 50; ++i) > > You should add missing braces here Fixed, thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975037150 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975038072 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975041294 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975037516 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975042933 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975041483 PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975043261 From mli at openjdk.org Fri Feb 28 09:05:54 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 28 Feb 2025 09:05:54 GMT Subject: RFR: 8350095: RISC-V: Refactor string_compare [v7] In-Reply-To: References: Message-ID: <3GsJNUEl_WzjKx-76PV4-ytooR6czZzlSFoFOlwlgrY=.10d906a7-34e1-482a-988f-3a6b29fb9110@github.com> On Fri, 28 Feb 2025 01:21:24 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: >> >> - check short string >> - rename > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1393: > >> 1391: >> 1392: const int base_offset = isLL ? arrayOopDesc::base_offset_in_bytes(T_BYTE) >> 1393: : arrayOopDesc::base_offset_in_bytes(T_CHAR); > > Sorry, But I don't think we need to distinuish `T_BYTE` and `T_CHAR` here. The reason is that the character storage used for both Latin1 and UTF16 strings is always a byte array [1]. So we should always use `T_BYTE` for both cases. > > [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L160 Thanks. As mentioned in this PR's description, there will be another follow-up pr later, I can do it in that one. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23633#discussion_r1975056134 From duke at openjdk.org Fri Feb 28 09:49:02 2025 From: duke at openjdk.org (kuaiwei) Date: Fri, 28 Feb 2025 09:49:02 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v12] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 12:22:16 GMT, kuaiwei wrote: >> kuaiwei has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix for review comments > > Merged with master > @kuaiwei Do you have github actions enabled? Because you only have 1 check passed, and usually there are about 16: ![image](https://private-user-images.githubusercontent.com/32593061/417914921-2e26c2c9-f272-4045-90f7-df8e6ab22066.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDA3MzYyMDgsIm5iZiI6MTc0MDczNTkwOCwicGF0aCI6Ii8zMjU5MzA2MS80MTc5MTQ5MjEtMmUyNmMyYzktZjI3Mi00MDQ1LTkwZjctZGY4ZTZhYjIyMDY2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjI4VDA5NDUwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg3MTkxMDcxNDZmM2M2YzYzNDA2OTlkNzYxNmIzMTM5MTI0MjlkNmE2NDVlMWFkZmVhMDk4OWQ5NzdmMjM3MzMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.QTGeB_vCIu7X_M8_D1SAV_xx-QIxKrJ7dsN22Y6Ppz0) I can't see "Action" tab in jdk repository. I think it need access permission. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2690197327 From roland at openjdk.org Fri Feb 28 09:55:54 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 28 Feb 2025 09:55:54 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v4] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 18:17:19 GMT, Kangcheng Xu wrote: >> [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. >> >> When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) >> >> The following was implemented to address this issue. >> >> if (UseNewCode2) { >> *multiplier = bt == T_INT >> ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows >> : ((jlong) 1) << con->get_int(); >> } else { >> *multiplier = ((jlong) 1 << con->get_int()); >> } >> >> >> Two new bitshift overflow tests were added. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > update license header year Changes requested by roland (Reviewer). src/hotspot/share/opto/addnode.cpp line 524: > 522: } > 523: > 524: lhs_multiplier = bt == T_INT Why isn't it: `java_shift_left(1, con->get_int(), bt)` ? src/hotspot/share/opto/addnode.cpp line 546: > 544: } > 545: > 546: *multiplier = lhs_multiplier + (bt == T_INT Same here ------------- PR Review: https://git.openjdk.org/jdk/pull/23506#pullrequestreview-2650277818 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r1975130092 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r1975130496 From mli at openjdk.org Fri Feb 28 10:28:27 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 28 Feb 2025 10:28:27 GMT Subject: RFR: 8350931: RISC-V: remove unnecessary src register for fp_sqrt_d/f Message-ID: Hi, Can you review this simple patch? It remove the unnecessary src register for fp_sqrt_d/f pipe_class, as sqrt has only one src register. Thanks ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/23839/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23839&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350931 Stats: 6 lines in 1 file changed: 0 ins; 2 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23839.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23839/head:pull/23839 PR: https://git.openjdk.org/jdk/pull/23839 From chagedorn at openjdk.org Fri Feb 28 10:43:55 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Feb 2025 10:43:55 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v5] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> Message-ID: On Fri, 28 Feb 2025 09:00:40 GMT, Daniel Lund?n wrote: >> When searching for load anti-dependences in GCM, the memory state for the load is sometimes represented not only by the memory node input of the load, but also other memory nodes. Because PhaseCFG::insert_anti_dependences searches for anti-dependences only from the load's memory input, it is, therefore, possible to sometimes overlook anti-dependences. The result is that loads are potentially scheduled too late, after stores that redefine the memory states of the loads. >> >> ### Changeset >> >> It is not yet clear why multiple nodes sometimes represent the memory state of a load, nor if this is expected. We can, however, resolve all the miscompiled test cases seen in this issue by improving the idealization of Phi nodes. Specifically, there is an idealization where we split Phis through input MergeMems, that we, prior to this changeset, applied too conservatively. >> >> To illustrate the idealization and how it resolves this issue, consider the example below. >> >> ![failure-graph-1](https://github.com/user-attachments/assets/ecbd204f-bdf0-49cb-a62e-8081d08cfe0c) >> >> `64 membar_release` is a critical anti-dependence for `183 loadI`. The anti-dependence search starts at the load's direct memory input, `107 Phi`, and stops immediately at Phis. Therefore, the search ends at `106 Phi` and we never find `64 membar_release`. >> >> We can apply the split-through-MergeMem Phi idealization to `119 Phi`. This idealization pushes `119 Phi` through `120 MergeMem` and `121 MergeMem`, splitting it into the individual inputs of the MergeMems in the process. As a result, we replace `119 Phi` with two new Phis. One of these generated Phis has identical inputs to `107 Phi` (`106 Phi` and `104 Phi`), and further idealizations will merge this new Phi and `107 Phi`. As a result, `107 Phi` then has a Phi-free path to `64 membar_release` and we correctly discover the anti-dependence. >> >> The changeset consists of the following changes. >> - Add an analysis that allows applying the split-through-MergeMem idealization in more cases than before (including in the above example) while still ensuring termination. >> - Add a missing `ResourceMark` in `PhiNode::split_out_instance`. >> - Add multiple new regression tests in `TestGCMLoadPlacement.java`. >> >> For reference, [here](https://github.com/openjdk/jdk/pull/22852) is a previous PR with an alternative fix that we decided to discard in favor of the fix in this PR. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/ac... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after Christian's review Thanks for the updates, it looks good to me! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23691#pullrequestreview-2650389788 From chagedorn at openjdk.org Fri Feb 28 10:43:55 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Feb 2025 10:43:55 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v4] In-Reply-To: <-kpJ15v9YZsu9G5Cr75jeHmdplV0H9x0t1t3xYhyw6A=.3f569932-dbd5-4e54-920f-368560f2c808@github.com> References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> <-kpJ15v9YZsu9G5Cr75jeHmdplV0H9x0t1t3xYhyw6A=.3f569932-dbd5-4e54-920f-368560f2c808@github.com> Message-ID: On Fri, 28 Feb 2025 08:52:12 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/cfgnode.cpp line 2435: >> >>> 2433: ResourceMark rm; >>> 2434: VectorSet visited; >>> 2435: Node_List worklist; >> >> You could also use a `Unique_Node_List` instead of a `Node_List` + `VectorSet`: >> >> Unique_Node_List worklist; >> for (uint i = 0; i < worklist.size(); i++) { >> } > > Correct me if I'm wrong, but doesn't this pattern lead to higher memory consumption because we need to keep already visited nodes in the worklist? It probably doesn't matter much in practice, but I find the current approach more hygienic. That's true that we have more memory consumption. But it's probably neglectable. However, we seem to use both approaches throughout our code base, so I don't have a strong opinion about what which version we use here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975197336 From mli at openjdk.org Fri Feb 28 11:46:37 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 28 Feb 2025 11:46:37 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp Message-ID: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> Hi, Can you help to review this simple change? Seems to me it's not necessary to assert_different_registers between dst/src1/src2 in minmax_fp. Thanks ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/23842/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23842&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350940 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23842.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23842/head:pull/23842 PR: https://git.openjdk.org/jdk/pull/23842 From mli at openjdk.org Fri Feb 28 11:50:08 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 28 Feb 2025 11:50:08 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v2] In-Reply-To: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> Message-ID: > Hi, > Can you help to review this simple change? > Seems to me it's not necessary to assert_different_registers between dst/src1/src2 in minmax_fp. > > Thanks Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - merge master - initial commit ------------- Changes: https://git.openjdk.org/jdk/pull/23842/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23842&range=01 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23842.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23842/head:pull/23842 PR: https://git.openjdk.org/jdk/pull/23842 From rrich at openjdk.org Fri Feb 28 11:59:55 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 28 Feb 2025 11:59:55 GMT Subject: RFR: 8347405: MergeStores with reverse bytes order value [v12] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 09:46:23 GMT, kuaiwei wrote: > > @kuaiwei Do you have github actions enabled? Because you only have 1 check passed, and usually there are about 16: ![image](https://private-user-images.githubusercontent.com/32593061/417914921-2e26c2c9-f272-4045-90f7-df8e6ab22066.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDA3MzYyMDgsIm5iZiI6MTc0MDczNTkwOCwicGF0aCI6Ii8zMjU5MzA2MS80MTc5MTQ5MjEtMmUyNmMyYzktZjI3Mi00MDQ1LTkwZjctZGY4ZTZhYjIyMDY2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjI4VDA5NDUwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg3MTkxMDcxNDZmM2M2YzYzNDA2OTlkNzYxNmIzMTM5MTI0MjlkNmE2NDVlMWFkZmVhMDk4OWQ5NzdmMjM3MzMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.QTGeB_vCIu7X_M8_D1SAV_xx-QIxKrJ7dsN22Y6Ppz0) > > I can't see "Action" tab in jdk repository. I think it need access permission. Likely it is disabled. Please check the settings (top middle) of your jdk repository. There's an "Actions" section. You should "Allow all actions and reusable workflows". ------------- PR Comment: https://git.openjdk.org/jdk/pull/23030#issuecomment-2690461892 From rrich at openjdk.org Fri Feb 28 12:14:07 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 28 Feb 2025 12:14:07 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Wed, 19 Feb 2025 00:37:14 GMT, Dean Long wrote: >> When calling a MethodHandle linker, such as linkToStatic, we drop the last argument, which causes a mismatch between what the caller pushed and what the callee received. In deoptimization, we check for this in several places, but in one place we had outdated code. See the bug for the gory details. >> >> In this PR I add asserts and a test to reproduce the problem, plus the necessary fixes in deoptimizations. There are other inefficiencies in deoptimization that I didn't address, hoping to simplify the fix for backports. >> >> Some platforms align locals according to the caller during deoptimization, while some align locals according to the callee. The asserts I added compute locals both ways and check that they are still within the frame. I attempted this on all platforms, but am only able to test x64 and aarch64. I need help testing those asserts for arm32, ppc, riscv, and s390. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Stricter assertion on ppc64 Marked as reviewed by rrich (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23557#pullrequestreview-2650575715 From rrich at openjdk.org Fri Feb 28 12:14:09 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 28 Feb 2025 12:14:09 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Thu, 27 Feb 2025 17:44:05 GMT, Patricio Chilano Mateo wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Stricter assertion on ppc64 > > src/hotspot/share/runtime/deoptimization.cpp line 645: > >> 643: methodHandle method(current, deopt_sender.interpreter_frame_method()); >> 644: Bytecode_invoke cur(method, deopt_sender.interpreter_frame_bci()); >> 645: if (!cur.is_invokedynamic() && MethodHandles::has_member_arg(cur.klass(), cur.name())) { > > I was confused with this new condition but I see is the same we have in `vframeArray::unpack_to_stack()`. +1 I see there's also an assertion in `ConstantPool::klass_ref_index_at()`. It might be worth a little comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1975310438 From fyang at openjdk.org Fri Feb 28 12:16:55 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Feb 2025 12:16:55 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v2] In-Reply-To: References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> Message-ID: <5x1sGGVPVH_9cTlJ_70P9BFWGxggQyb8EOerJHGizYU=.3e41c872-2e47-4ae6-9d47-3bb47a953b55@github.com> On Fri, 28 Feb 2025 11:50:08 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this simple change? >> Seems to me it's not necessary to assert_different_registers between dst/src1/src2 in minmax_fp. >> >> Thanks > > Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - merge master > - initial commit src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 2140: > 2138: assert_different_registers(dst, src1); > 2139: assert_different_registers(dst, src2); > 2140: Ah, Seems I missed this in my last PR :-) Looks reasonable to me. But we should remove the `TEMP_DEF dst` effect from the related match rule in riscv.ad file where it is called. One example: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/riscv.ad#L7290 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23842#discussion_r1975314192 From duke at openjdk.org Fri Feb 28 12:25:28 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 28 Feb 2025 12:25:28 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v6] In-Reply-To: References: Message-ID: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: more precise testing ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23694/files - new: https://git.openjdk.org/jdk/pull/23694/files/2582890a..3c5fb248 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=04-05 Stats: 36 lines in 3 files changed: 36 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23694.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23694/head:pull/23694 PR: https://git.openjdk.org/jdk/pull/23694 From mli at openjdk.org Fri Feb 28 12:41:53 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 28 Feb 2025 12:41:53 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v3] In-Reply-To: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> Message-ID: > Hi, > Can you help to review this simple change? > Seems to me it's not necessary to assert_different_registers between dst/src1/src2 in minmax_fp. > > Thanks Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: remove unnecessary effect ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23842/files - new: https://git.openjdk.org/jdk/pull/23842/files/e910c620..f6fba6bf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23842&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23842&range=01-02 Stats: 4 lines in 1 file changed: 0 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23842.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23842/head:pull/23842 PR: https://git.openjdk.org/jdk/pull/23842 From mli at openjdk.org Fri Feb 28 12:41:54 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 28 Feb 2025 12:41:54 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v3] In-Reply-To: <5x1sGGVPVH_9cTlJ_70P9BFWGxggQyb8EOerJHGizYU=.3e41c872-2e47-4ae6-9d47-3bb47a953b55@github.com> References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> <5x1sGGVPVH_9cTlJ_70P9BFWGxggQyb8EOerJHGizYU=.3e41c872-2e47-4ae6-9d47-3bb47a953b55@github.com> Message-ID: On Fri, 28 Feb 2025 12:13:15 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> remove unnecessary effect > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 2140: > >> 2138: assert_different_registers(dst, src1); >> 2139: assert_different_registers(dst, src2); >> 2140: > > Ah, Seems I missed this in my last PR :-) Looks reasonable to me. > But we should remove the `TEMP_DEF dst` effect from the related match rule in riscv.ad file where it is called. > One example: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/riscv.ad#L7290 Yes, will do it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23842#discussion_r1975344776 From fyang at openjdk.org Fri Feb 28 12:48:59 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Feb 2025 12:48:59 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v3] In-Reply-To: References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> <5x1sGGVPVH_9cTlJ_70P9BFWGxggQyb8EOerJHGizYU=.3e41c872-2e47-4ae6-9d47-3bb47a953b55@github.com> Message-ID: On Fri, 28 Feb 2025 12:38:43 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 2140: >> >>> 2138: assert_different_registers(dst, src1); >>> 2139: assert_different_registers(dst, src2); >>> 2140: >> >> Ah, Seems I missed this in my last PR :-) Looks reasonable to me. >> But we should remove the `TEMP_DEF dst` effect from the related match rule in riscv.ad file where it is called. >> One example: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/riscv.ad#L7290 > > Yes, will do it. Thanks. But I think you should keep `KILL cr` effect as this assember routine modifies `t1` thus kills `cr`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23842#discussion_r1975352939 From fyang at openjdk.org Fri Feb 28 12:50:00 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Feb 2025 12:50:00 GMT Subject: RFR: 8350931: RISC-V: remove unnecessary src register for fp_sqrt_d/f In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 10:19:13 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch? > It remove the unnecessary src register for fp_sqrt_d/f pipe_class, as sqrt has only one src register. > Thanks Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23839#pullrequestreview-2650649639 From mli at openjdk.org Fri Feb 28 12:56:10 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 28 Feb 2025 12:56:10 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v4] In-Reply-To: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> Message-ID: <4PGqc4-JerCL5wzS_4DJDCpiP0Tc6xN5s7HH2LuV9Ao=.5a2ca4a7-d660-4bdb-ac85-4e2d0f2a15de@github.com> > Hi, > Can you help to review this simple change? > Seems to me it's not necessary to assert_different_registers between dst/src1/src2 in minmax_fp. > > Thanks Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: keep cr/t1 in effect ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23842/files - new: https://git.openjdk.org/jdk/pull/23842/files/f6fba6bf..a5d9782c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23842&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23842&range=02-03 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23842.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23842/head:pull/23842 PR: https://git.openjdk.org/jdk/pull/23842 From mli at openjdk.org Fri Feb 28 12:56:10 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 28 Feb 2025 12:56:10 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v4] In-Reply-To: References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> <5x1sGGVPVH_9cTlJ_70P9BFWGxggQyb8EOerJHGizYU=.3e41c872-2e47-4ae6-9d47-3bb47a953b55@github.com> Message-ID: On Fri, 28 Feb 2025 12:45:47 GMT, Fei Yang wrote: >> Yes, will do it. > > Thanks. But I think you should keep `KILL cr` effect as this assember routine modifies `t1` thus kills `cr`. Ah, I missed this, thanks for catching! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23842#discussion_r1975367748 From dlunden at openjdk.org Fri Feb 28 13:10:59 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 28 Feb 2025 13:10:59 GMT Subject: RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v4] In-Reply-To: References: <2HzvnZfO23KmMBnTXVx1fi3xeOCjeGlHFVsTijaFK7c=.0d511c73-505b-4db1-8622-6e823a1e2f0a@github.com> <-kpJ15v9YZsu9G5Cr75jeHmdplV0H9x0t1t3xYhyw6A=.3f569932-dbd5-4e54-920f-368560f2c808@github.com> Message-ID: On Fri, 28 Feb 2025 10:40:55 GMT, Christian Hagedorn wrote: >> Correct me if I'm wrong, but doesn't this pattern lead to higher memory consumption because we need to keep already visited nodes in the worklist? It probably doesn't matter much in practice, but I find the current approach more hygienic. > > That's true that we have more memory consumption. But it's probably neglectable. However, we seem to use both approaches throughout our code base, so I don't have a strong opinion about what which version we use here. OK, I'll leave it as is then for this changeset. Perhaps we should have a broader discussion at some point to make this consistent across a larger part of the code base. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23691#discussion_r1975386068 From fyang at openjdk.org Fri Feb 28 13:26:56 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Feb 2025 13:26:56 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v4] In-Reply-To: <4PGqc4-JerCL5wzS_4DJDCpiP0Tc6xN5s7HH2LuV9Ao=.5a2ca4a7-d660-4bdb-ac85-4e2d0f2a15de@github.com> References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> <4PGqc4-JerCL5wzS_4DJDCpiP0Tc6xN5s7HH2LuV9Ao=.5a2ca4a7-d660-4bdb-ac85-4e2d0f2a15de@github.com> Message-ID: On Fri, 28 Feb 2025 12:56:10 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this simple change? >> Seems to me it's not necessary to assert_different_registers between dst/src1/src2 in minmax_fp. >> >> Thanks > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > keep cr/t1 in effect Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23842#pullrequestreview-2650741746 From dfenacci at openjdk.org Fri Feb 28 13:37:13 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Fri, 28 Feb 2025 13:37:13 GMT Subject: RFR: 8302459: Missing late inline cleanup causes compiler/vectorapi/VectorLogicalOpIdentityTest.java IR failure [v3] In-Reply-To: References: Message-ID: > # Issue > > The `compiler/vectorapi/VectorLogicalOpIdentityTest.java` has been failing because C2 compiling the test `testAndMaskSameValue1` expects to have 1 `AndV` nodes but it has none. > > # Cause > > The issue has to do with the criteria that trigger a cleanup when performing late inlining. In the failing test, when the compiler tries to inline a `jdk.internal.vm.vector.VectorSupport::binaryOp` call, it fails because its argument is of the wrong type, mainly because some cast nodes ?hide? the more ?precise? type. > The graph that leads to the issue looks like this: > ![1BCE8148-1E44-4CA1-AF8F-EFC6210FA740](https://github.com/user-attachments/assets/62dd917f-2dac-42a9-90cf-73eedcd3cf8a) > The compiler tries to inline `jdk.internal.vm.vector.VectorSupport::load` and it succeeds: > ![752E81C9-A37D-4626-81A9-E4A839FADD3D](https://github.com/user-attachments/assets/e61057b2-3093-4992-ba5a-b80e4000c0ec) > The node `3027 VectorBox` has type `IntMaxVector`. `912 CastPP` and `934 CheckCastPP` have type `IntVector`instead. > The compiler then tries to inline one of the 2 `bynaryOp` calls but it fails because it needs an argument of type `IntMaxVector` and the argument it is given, which is node `934 CheckCastPP` , has type `IntVector`. > > This would not happen if between the 2 inlining attempts a _cleanup_ was triggered. IGVN would run and the 2 nodes `912 CastPP` and `934 CheckCastPP` would be folded away. `binaryOp` could then be inlined since the types would match. > > # Solution > > Instead of fixing this specific case we try a more generic approach: when late inlining we keep track of failed intrinsics and re-examine them during IGVN. If the `Ideal` method for their call node is called, we reschedule the intrinsic attempt for that call. > > # Testing > > Additional test runs with `-XX:-TieredCompilation` are added to `VectorLogicalOpIdentityTest.java` and `VectorGatherMaskFoldingTest.java` as regression tests and `-XX:+IncrementalInlineForceCleanup` is removed from `VectorGatherMaskFoldingTest.java` (previously added as workaround for this issue) > > Tests: Tier 1-4 (windows-x64, linux-x64/aarch64, and macosx-x64/aarch64; release and debug mode) Damon Fenacci has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 40 commits: - JDK-8302459: unneeded changes - JDK-8302459: unneeded changes - JDK-8302459: update assert string - JDK-8302459: fix copyright year - JDK-8302459: fix after merge - Merge branch 'master' into JDK-8302459-new - JDK-8302459: add logging - JDK-8302459: remove todos - JDK-8302459: add check to avoid infinite loop - Merge branch 'master' into JDK-8302459-new - ... and 30 more: https://git.openjdk.org/jdk/compare/a637ccf2...e71e72f5 ------------- Changes: https://git.openjdk.org/jdk/pull/21682/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21682&range=02 Stats: 89 lines in 6 files changed: 36 ins; 3 del; 50 mod Patch: https://git.openjdk.org/jdk/pull/21682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21682/head:pull/21682 PR: https://git.openjdk.org/jdk/pull/21682 From duke at openjdk.org Fri Feb 28 15:09:07 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 28 Feb 2025 15:09:07 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v7] In-Reply-To: References: Message-ID: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: more precise testing ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23694/files - new: https://git.openjdk.org/jdk/pull/23694/files/3c5fb248..c21da78d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=05-06 Stats: 64 lines in 2 files changed: 60 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23694.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23694/head:pull/23694 PR: https://git.openjdk.org/jdk/pull/23694 From rrich at openjdk.org Fri Feb 28 15:27:05 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 28 Feb 2025 15:27:05 GMT Subject: RFR: 8336042: Caller/callee param size mismatch in deoptimization causes crash [v3] In-Reply-To: References: <4MjR9hdInhuJduDqpTqpGiyo_M_JQ6pM2g5_TgzcSTg=.16037e60-de66-4d0b-861b-19be80ff2751@github.com> Message-ID: On Fri, 28 Feb 2025 12:11:05 GMT, Richard Reingruber wrote: >> src/hotspot/share/runtime/deoptimization.cpp line 645: >> >>> 643: methodHandle method(current, deopt_sender.interpreter_frame_method()); >>> 644: Bytecode_invoke cur(method, deopt_sender.interpreter_frame_bci()); >>> 645: if (!cur.is_invokedynamic() && MethodHandles::has_member_arg(cur.klass(), cur.name())) { >> >> I was confused with this new condition but I see is the same we have in `vframeArray::unpack_to_stack()`. > > +1 > I see there's also an assertion in `ConstantPool::klass_ref_index_at()`. It might be worth a little comment. Actually I think that there should be an abstraction that hides that detail. Probably `has_member_arg` should be a method of `Bytecode_invoke`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23557#discussion_r1975594243 From thartmann at openjdk.org Fri Feb 28 15:33:58 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 28 Feb 2025 15:33:58 GMT Subject: RFR: 8349637: Integer.numberOfLeadingZeros outputs incorrectly in certain cases [v6] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 02:25:34 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a fix for a miscompile in the AVX2 implementation of `CountLeadingZerosV` for int types. Currently, the implementation turns ints into floats, in order to calculating the leading zeros based on the exponent part of the float. Unfortunately, floats can only accurately represent integers up to 2^24. After that, multiple integer values can map onto the same floating point value. The issue manifests when an int is converted to a floating point representation that is higher than it, crossing a bit boundary. As an example, `(float)0x01FFFFFF == (float)0x02000000`, but `lzcnt(0x01FFFFFF) == 7` and `lzcnt(0x02000000) == 6`. The values are incorrectly rounded up. >> >> This patch fixes the issue by masking the input in the cases where it is larger than 2^24, to set the low bits to 0. Removing these bits prevents the accidental rounding behavior. I've added these cases to`TestNumberOfContinuousZeros`, and removed the set random seed so that it can produce random inputs to test with. >> >> Reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Add Vector API Test Looks good and testing all passed. Ship it! :) ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23579#pullrequestreview-2651093540 From coleenp at openjdk.org Fri Feb 28 15:49:20 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 15:49:20 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v5] In-Reply-To: References: Message-ID: > This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. > Tested with tier1-4. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Review comments and use COMPILER2 preprocessing macro instead of the presence of Matcher. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23782/files - new: https://git.openjdk.org/jdk/pull/23782/files/fca987a4..2d9ab884 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23782&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23782&range=03-04 Stats: 50 lines in 6 files changed: 1 ins; 44 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/23782.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23782/head:pull/23782 PR: https://git.openjdk.org/jdk/pull/23782 From chagedorn at openjdk.org Fri Feb 28 15:56:13 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Feb 2025 15:56:13 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v7] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 15:09:07 GMT, Marc Chevalier wrote: >> Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > more precise testing Nice test updates! Some final comments, then I think it's good to go from my side :) test/hotspot/jtreg/compiler/c2/irTests/ModDNodeTests.java line 151: > 149: // is that they exercise a slightly different reason why the node is being removed, > 150: // and thus a different execution path. In unusedResultAfterLoopOpt1 the modulo is > 151: // used in the traps of the parse predicate. In unusedResultAfterLoopOpt2, it is not. Suggestion: // used in the traps of the parse predicates. In unusedResultAfterLoopOpt2, it is not. test/hotspot/jtreg/compiler/c2/irTests/ModDNodeTests.java line 153: > 151: // used in the traps of the parse predicate. In unusedResultAfterLoopOpt2, it is not. > 152: @Test > 153: @IR(failOn = {"drem"}, phase = CompilePhase.BEFORE_MATCHING) This is not required now anymore since you do the matching on the node directly instead of the expanded call. Same for the other rules below. test/hotspot/jtreg/compiler/c2/irTests/ModFNodeTests.java line 151: > 149: // is that they exercise a slightly different reason why the node is being removed, > 150: // and thus a different execution path. In unusedResultAfterLoopOpt1 the modulo is > 151: // used in the traps of the parse predicate. In unusedResultAfterLoopOpt2, it is not. Suggestion: // used in the traps of the parse predicates. In unusedResultAfterLoopOpt2, it is not. test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 2596: > 2594: IR_NODE_MAPPINGS.put(irNodePlaceholder, new SinglePhaseRangeEntry(CompilePhase.BEFORE_MACRO_EXPANSION, regex, > 2595: CompilePhase.BEFORE_STRINGOPTS, > 2596: CompilePhase.BEFORE_MACRO_EXPANSION)); Indentation: Suggestion: IR_NODE_MAPPINGS.put(irNodePlaceholder, new SinglePhaseRangeEntry(CompilePhase.BEFORE_MACRO_EXPANSION, regex, CompilePhase.BEFORE_STRINGOPTS, CompilePhase.BEFORE_MACRO_EXPANSION)); ------------- PR Review: https://git.openjdk.org/jdk/pull/23694#pullrequestreview-2651134729 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1975632096 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1975633790 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1975634535 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1975637962 From coleenp at openjdk.org Fri Feb 28 15:59:09 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 15:59:09 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v4] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 18:06:19 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > More friends. Thanks for looking at the change Chris and Vladimir. With this change, we can further remove these VM_STRUCTS macros like declare_c1_type, declare_c1_toplevel_type, etc. It makes for a bigger change, but only more lines and more tedious. ------------- PR Review: https://git.openjdk.org/jdk/pull/23782#pullrequestreview-2651124547 From coleenp at openjdk.org Fri Feb 28 15:59:09 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 15:59:09 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v6] In-Reply-To: References: Message-ID: > This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. > Tested with tier1-4. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Remove VM_STRUCTS macro arguments. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23782/files - new: https://git.openjdk.org/jdk/pull/23782/files/2d9ab884..aa80d2c6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23782&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23782&range=04-05 Stats: 382 lines in 26 files changed: 1 ins; 218 del; 163 mod Patch: https://git.openjdk.org/jdk/pull/23782.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23782/head:pull/23782 PR: https://git.openjdk.org/jdk/pull/23782 From coleenp at openjdk.org Fri Feb 28 15:59:09 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 15:59:09 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v4] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 02:40:10 GMT, Vladimir Kozlov wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> More friends. > > src/hotspot/share/runtime/vmStructs.cpp line 706: > >> 704: volatile_nonstatic_field(MonitorList, _head, ObjectMonitor*) \ >> 705: \ >> 706: unchecked_c2_static_field(Matcher, _regEncode, sizeof(Matcher::_regEncode)) /* NOTE: no type */ \ > > I don't see usage in SA of `VMReg::regEncode()` which access this field. This code isn't used but I thought it was somehow needed for stack dumping in one of the Xcomp/SA tests. But it was really using the presence of the Matcher type to see if isServerCompiler(). So I added the COMPILER2 preprocessor macro and use that instead. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23782#discussion_r1975629323 From coleenp at openjdk.org Fri Feb 28 15:59:10 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 15:59:10 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v6] In-Reply-To: References: Message-ID: <9r7BTy43ljeNbk2n7D320P1DTJNMVD7hWfTfy3Pnx5Q=.c3b42603-6ea8-4d6b-a026-76caf7280889@github.com> On Fri, 28 Feb 2025 01:01:48 GMT, Chris Plummer wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove VM_STRUCTS macro arguments. > > src/jdk.hotspot.agent/doc/index.html line 43: > >> 41:

>> 42: >> 43:

Compilation Replay

> > clhsdb.html also needs to be updated. Fixed. > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/CompilerThread.java line 48: > >> 46: private static synchronized void initialize(TypeDataBase db) throws WrongTypeException { >> 47: Type type = db.lookupType("CompilerThread"); >> 48: > > Line 47 above is no longer needed. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23782#discussion_r1975626044 PR Review Comment: https://git.openjdk.org/jdk/pull/23782#discussion_r1975630237 From coleenp at openjdk.org Fri Feb 28 16:03:00 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 16:03:00 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v6] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 15:59:09 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Remove VM_STRUCTS macro arguments. The os_cpu VM_STRUCTS, VM_TYPES, etc have no declarations and should be removed, but I think I should stop here. ------------- PR Review: https://git.openjdk.org/jdk/pull/23782#pullrequestreview-2651165490 From duke at openjdk.org Fri Feb 28 16:04:31 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 28 Feb 2025 16:04:31 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v8] In-Reply-To: References: Message-ID: > Remove frem and drem macros nodes when the result is not used. These nodes have other outputs (like memory), which is not meaningful, but preventing them to be dropped so easily. This patch removes the useless frem/drem nodes, and by rewiring the inputs to the outputs. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: address comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23694/files - new: https://git.openjdk.org/jdk/pull/23694/files/c21da78d..bfd185d8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23694&range=06-07 Stats: 14 lines in 3 files changed: 0 ins; 10 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23694.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23694/head:pull/23694 PR: https://git.openjdk.org/jdk/pull/23694 From duke at openjdk.org Fri Feb 28 16:04:32 2025 From: duke at openjdk.org (Marc Chevalier) Date: Fri, 28 Feb 2025 16:04:32 GMT Subject: RFR: 8349523: Unused runtime calls to drem/frem should be removed [v7] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 15:47:14 GMT, Christian Hagedorn wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> more precise testing > > test/hotspot/jtreg/compiler/c2/irTests/ModDNodeTests.java line 151: > >> 149: // is that they exercise a slightly different reason why the node is being removed, >> 150: // and thus a different execution path. In unusedResultAfterLoopOpt1 the modulo is >> 151: // used in the traps of the parse predicate. In unusedResultAfterLoopOpt2, it is not. > > Suggestion: > > // used in the traps of the parse predicates. In unusedResultAfterLoopOpt2, it is not. Done. > test/hotspot/jtreg/compiler/c2/irTests/ModDNodeTests.java line 153: > >> 151: // used in the traps of the parse predicate. In unusedResultAfterLoopOpt2, it is not. >> 152: @Test >> 153: @IR(failOn = {"drem"}, phase = CompilePhase.BEFORE_MATCHING) > > This is not required now anymore since you do the matching on the node directly instead of the expanded call. Same for the other rules below. Done. And in the other file. > test/hotspot/jtreg/compiler/c2/irTests/ModFNodeTests.java line 151: > >> 149: // is that they exercise a slightly different reason why the node is being removed, >> 150: // and thus a different execution path. In unusedResultAfterLoopOpt1 the modulo is >> 151: // used in the traps of the parse predicate. In unusedResultAfterLoopOpt2, it is not. > > Suggestion: > > // used in the traps of the parse predicates. In unusedResultAfterLoopOpt2, it is not. Done. > test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 2596: > >> 2594: IR_NODE_MAPPINGS.put(irNodePlaceholder, new SinglePhaseRangeEntry(CompilePhase.BEFORE_MACRO_EXPANSION, regex, >> 2595: CompilePhase.BEFORE_STRINGOPTS, >> 2596: CompilePhase.BEFORE_MACRO_EXPANSION)); > > Indentation: > Suggestion: > > IR_NODE_MAPPINGS.put(irNodePlaceholder, new SinglePhaseRangeEntry(CompilePhase.BEFORE_MACRO_EXPANSION, regex, > CompilePhase.BEFORE_STRINGOPTS, > CompilePhase.BEFORE_MACRO_EXPANSION)); Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1975647138 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1975648163 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1975647334 PR Review Comment: https://git.openjdk.org/jdk/pull/23694#discussion_r1975647706 From kvn at openjdk.org Fri Feb 28 16:41:11 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Feb 2025 16:41:11 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v6] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 16:00:00 GMT, Coleen Phillimore wrote: > The os_cpu VM_STRUCTS, VM_TYPES, etc have no declarations and should be removed, but I think I should stop here. File an other "starter" RFE ------------- PR Comment: https://git.openjdk.org/jdk/pull/23782#issuecomment-2691088168 From kvn at openjdk.org Fri Feb 28 16:47:59 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Feb 2025 16:47:59 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v6] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 15:59:09 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Remove VM_STRUCTS macro arguments. Zero build is broken: src/hotspot/share/runtime/vmStructs.cpp:1321:46: error: ?COMPILER2? was not declared in this scope; did you mean ?NOT_COMPILER2?? 1321 | declare_preprocessor_constant("COMPILER2", COMPILER2) \ | ^~~~~~~~~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/23782#issuecomment-2691102469 From kvn at openjdk.org Fri Feb 28 17:30:58 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Feb 2025 17:30:58 GMT Subject: RFR: 8350893: Use generated names for hand generated opto runtime blobs In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 18:00:37 GMT, Andrew Dinn wrote: > The two special case opto runtime blobs that support uncommon trap and exception handling are currently being generated using hard wired blob names determined by port-specific code. They should employ the standard blob names generated from shared declarations in file stubDeclarations.hpp. src/hotspot/cpu/arm/runtime_arm.cpp line 210: > 208: // setup code generation tools > 209: // Measured 8/7/03 at 256 in 32bit debug build > 210: const char* name = OptoRuntime::stub_name(OptoStubId::uncommon_trap_id); Typo. Should be `exception_id` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23829#discussion_r1975782763 From kxu at openjdk.org Fri Feb 28 18:00:14 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Fri, 28 Feb 2025 18:00:14 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v5] In-Reply-To: References: Message-ID: > [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. > > When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) > > The following was implemented to address this issue. > > if (UseNewCode2) { > *multiplier = bt == T_INT > ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows > : ((jlong) 1) << con->get_int(); > } else { > *multiplier = ((jlong) 1 << con->get_int()); > } > > > Two new bitshift overflow tests were added. Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: remove tri-conditionals ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23506/files - new: https://git.openjdk.org/jdk/pull/23506/files/f570024d..358bcbac Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=03-04 Stats: 9 lines in 1 file changed: 0 ins; 6 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23506.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23506/head:pull/23506 PR: https://git.openjdk.org/jdk/pull/23506 From kxu at openjdk.org Fri Feb 28 18:00:14 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Fri, 28 Feb 2025 18:00:14 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v4] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 09:52:06 GMT, Roland Westrelin wrote: >> Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: >> >> update license header year > > src/hotspot/share/opto/addnode.cpp line 524: > >> 522: } >> 523: >> 524: lhs_multiplier = bt == T_INT > > Why isn't it: `java_shift_left(1, con->get_int(), bt)` ? You're absolutely right. Sorry, that was a very naive oversight. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r1975820086 From coleenp at openjdk.org Fri Feb 28 18:10:26 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 18:10:26 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v7] In-Reply-To: References: Message-ID: > This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. > Tested with tier1-4. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Fix COMPILER2 preprocessor constant for SA. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23782/files - new: https://git.openjdk.org/jdk/pull/23782/files/aa80d2c6..38041d03 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23782&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23782&range=05-06 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23782.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23782/head:pull/23782 PR: https://git.openjdk.org/jdk/pull/23782 From duke at openjdk.org Fri Feb 28 18:11:18 2025 From: duke at openjdk.org (Abdelhak Zaaim) Date: Fri, 28 Feb 2025 18:11:18 GMT Subject: RFR: 8350940: RISC-V: remove unnecessary assert_different_registers in minmax_fp [v4] In-Reply-To: <4PGqc4-JerCL5wzS_4DJDCpiP0Tc6xN5s7HH2LuV9Ao=.5a2ca4a7-d660-4bdb-ac85-4e2d0f2a15de@github.com> References: <8fxIj9ChMAOETSVV62zYzhRZRfCmEDRtvMc88hncB5E=.4c39c81b-2004-4d72-8aa2-ff63be3996a4@github.com> <4PGqc4-JerCL5wzS_4DJDCpiP0Tc6xN5s7HH2LuV9Ao=.5a2ca4a7-d660-4bdb-ac85-4e2d0f2a15de@github.com> Message-ID: <8EidemwiVO6H5vumvwMUNtwACfS0tlmFf__hES7mMmY=.1e9b99fe-15bb-46a0-b239-0dd777a96fd9@github.com> On Fri, 28 Feb 2025 12:56:10 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this simple change? >> Seems to me it's not necessary to assert_different_registers between dst/src1/src2 in minmax_fp. >> >> Thanks > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > keep cr/t1 in effect Marked as reviewed by abdelhak-zaaim at github.com (no known OpenJDK username). ------------- PR Review: https://git.openjdk.org/jdk/pull/23842#pullrequestreview-2651479189 From duke at openjdk.org Fri Feb 28 18:57:53 2025 From: duke at openjdk.org (Vivek Deshpande) Date: Fri, 28 Feb 2025 18:57:53 GMT Subject: RFR: 8350609: cleanup unknown unwind opcode (0xB) for windows In-Reply-To: References: Message-ID: <4BUZBMlLC1zPnsieDNsbkji0xcmHLd_VHRN3bEhpJ3A=.5d2b315c-20de-4770-93e1-846cd733cde0@github.com> On Thu, 20 Feb 2025 03:58:17 GMT, Dhamoder Nalla wrote: > This PR is to clean-up unknown unwind opcodes (0xB) in Windows intrinsic functions introduced in commit https://github.com/openjdk/jdk17u-dev/commit/9f05c411e6d6bdf612cf0cf8b9fe4ca9ecde50d1#diff-a024df6bcd94607260545e647922261703a652dee1afadb1fa758f6e74a568d1 > > ![image](https://github.com/user-attachments/assets/5b295365-ba8e-4fd6-8b8b-f7243f80a496) > > According to the Windows unwind Opcodes outlined at https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170#unwind-operation-code, the opcode 0xB (1011) is not a valid Opcode, as the valid opcodes range from 0 to 10. Did you get to test these functions for correctness, possibly using jtreg or some other tests ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23707#issuecomment-2691318385 From kvn at openjdk.org Fri Feb 28 19:25:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Feb 2025 19:25:56 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v7] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 18:10:26 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix COMPILER2 preprocessor constant for SA. Very very nice cleanup! ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23782#pullrequestreview-2651609942 From cjplummer at openjdk.org Fri Feb 28 19:42:54 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Fri, 28 Feb 2025 19:42:54 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v7] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 18:10:26 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix COMPILER2 preprocessor constant for SA. I wonder if you shouldn't use a new CR for this PR to make it clear to anyone that skims over the title of the PR/CR that (broken) functionality is being removed, not fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23782#issuecomment-2691416497 From kvn at openjdk.org Fri Feb 28 20:22:55 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Feb 2025 20:22:55 GMT Subject: RFR: 8315488: SA ciReplay support is no longer up-to-date with hotspot ciReplay support [v7] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 19:40:11 GMT, Chris Plummer wrote: > I wonder if you shouldn't use a new CR for this PR to make it clear to anyone that skims over the title of the PR/CR that (broken) functionality is being removed, not fixed. Just change title: "Remove outdated and unused ciReplay support from SA" ------------- PR Comment: https://git.openjdk.org/jdk/pull/23782#issuecomment-2691478686 From coleenp at openjdk.org Fri Feb 28 20:34:01 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 20:34:01 GMT Subject: RFR: 8315488: Remove outdated and unused ciReplay support from SA [v7] In-Reply-To: References: Message-ID: On Fri, 28 Feb 2025 20:19:27 GMT, Vladimir Kozlov wrote: >> I wonder if you shouldn't use a new CR for this PR to make it clear to anyone that skims over the title of the PR/CR that (broken) functionality is being removed, not fixed. > Just change title: "Remove outdated and unused ciReplay support from SA" Yes, I'll do this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23782#issuecomment-2691495918 From coleenp at openjdk.org Fri Feb 28 20:37:57 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 28 Feb 2025 20:37:57 GMT Subject: RFR: 8315488: Remove outdated and unused ciReplay support from SA [v7] In-Reply-To: References: Message-ID: <2vNU1BcX8FSq7s_iWc6oLtaQn14mK17MNGue6p_EWS4=.f0892f74-4abb-4313-88ad-df1d4e7853b2@github.com> On Fri, 28 Feb 2025 18:10:26 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix COMPILER2 preprocessor constant for SA. Thank you for reviewing Vladimir. I'll file the other issue but not a "starter". New people should do more interesting things :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23782#issuecomment-2691502362 From robilad at openjdk.org Fri Feb 28 21:06:59 2025 From: robilad at openjdk.org (Dalibor Topic) Date: Fri, 28 Feb 2025 21:06:59 GMT Subject: RFR: 8349180: Remove redundant initialization in ciField constructor In-Reply-To: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> References: <6UgSS5QmRnPJ_R7DmpXyKCzL_u2t_vdfOC1IEjRgIa8=.3950376e-a5a3-4d2c-b610-8f92dd35db09@github.com> Message-ID: On Fri, 14 Feb 2025 15:00:06 GMT, Marc Chevalier wrote: > In `ciField`'s ctor, `_name` is initialized twice. I think we can indeed apply the suggested fix and remove the second assignment. `_name` is set correctly the first time (and without the useless cast), and not modified in between. > > Thanks, > Marc Please send me an e-mail at Dalibor.Topic at oracle.com so that I can verify your account's OCA. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23637#issuecomment-2691543250 From cjplummer at openjdk.org Fri Feb 28 22:14:54 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Fri, 28 Feb 2025 22:14:54 GMT Subject: RFR: 8315488: Remove outdated and unused ciReplay support from SA [v7] In-Reply-To: References: Message-ID: <0sRMry6CP4Bis6-ALc4UCQzBLOP4sr0Ejcd7hoTolQw=.81937e97-ceb6-43fb-9449-ab49bb307853@github.com> On Fri, 28 Feb 2025 18:10:26 GMT, Coleen Phillimore wrote: >> This change removes the ci, c1 and c2 compiler code from the serviceability agent. The ciReplay functionality is supported inside the jvm and this duplicated functionality in SA had bit rotted so is removed. >> Tested with tier1-4. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix COMPILER2 preprocessor constant for SA. Marked as reviewed by cjplummer (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23782#pullrequestreview-2651860962