From duke at openjdk.java.net Wed Jun 1 00:49:38 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Wed, 1 Jun 2022 00:49:38 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries In-Reply-To: References: Message-ID: On Thu, 19 May 2022 06:37:28 GMT, Yuta Sato wrote: > When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), > there is no warning message (only can see info level messages if put -Xlog:os=info). > This should show a warning message to tell the user that you failed to load libraries for hsdis. > So I put a warning message to notify this. > > e.g. > ` FWIW it seems to me that `load_library` is indeed the best place to print the message. src/hotspot/share/compiler/disassembler.cpp line 841: > 839: os::dll_lookup(_library, decode_instructions_virtual_name)); > 840: } else { > 841: log_warning(os)("Try to load hsdis library failed"); NIT: I suggest adding the name of the file that was searched for and rewording this message a bit: "Failed to load hsdis library: hsdis-x86.so" ------------- PR: https://git.openjdk.java.net/jdk/pull/8782 From pli at openjdk.java.net Wed Jun 1 02:16:38 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Wed, 1 Jun 2022 02:16:38 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 11:09:09 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rewrite the scalar calculation to avoid inline > > Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 > - Merge branch 'master' into fg8283307 > > Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 > - Remove related comments in some test files > > Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 > - Merge branch 'master' into fg8283307 > > Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a > - 8283307: Vectorize unsigned shift right on signed subword types > > ``` > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > ``` > In C2's SLP, vectorization of unsigned shift right on signed > subword types (byte/short) like the case above is intentionally > disabled[1]. Because the vector unsigned shift on signed > subword types behaves differently from the Java spec. It's > worthy to vectorize more cases in quite low cost. Also, > unsigned shift right on signed subword is not uncommon and we > may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > > Short: > | <- 16 bits -> | <- 16 bits -> | > | 1 1 1 ... 1 1 | data | > > when the shift amount is a constant not greater than the number > of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be > transformed into a signed shift and hence becomes vectorizable. > Here is the transformation: > > For T_SHORT (shift <= 16): > src RShiftCntV shift src RShiftCntV shift > \ / ==> \ / > URShiftVS RShiftVS > > This patch does the transformation in SuperWord::implemented() and > SuperWord::output(). It helps vectorize the short cases above. We > can handle unsigned right shift on byte type in a similar way. The > generated assembly code for one iteration on aarch64 is like: > ``` > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > ``` > > Here is the performance data for micro-benchmark before and after > this patch on both AArch64 and x64 machines. We can observe about > ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 Marked as reviewed by pli (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From pli at openjdk.java.net Wed Jun 1 02:16:40 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Wed, 1 Jun 2022 02:16:40 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: On Tue, 31 May 2022 09:12:10 GMT, Pengfei Li wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Rewrite the scalar calculation to avoid inline >> >> Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 >> - Merge branch 'master' into fg8283307 >> >> Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 >> - Remove related comments in some test files >> >> Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 >> - Merge branch 'master' into fg8283307 >> >> Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a >> - 8283307: Vectorize unsigned shift right on signed subword types >> >> ``` >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> ``` >> In C2's SLP, vectorization of unsigned shift right on signed >> subword types (byte/short) like the case above is intentionally >> disabled[1]. Because the vector unsigned shift on signed >> subword types behaves differently from the Java spec. It's >> worthy to vectorize more cases in quite low cost. Also, >> unsigned shift right on signed subword is not uncommon and we >> may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> >> Short: >> | <- 16 bits -> | <- 16 bits -> | >> | 1 1 1 ... 1 1 | data | >> >> when the shift amount is a constant not greater than the number >> of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be >> transformed into a signed shift and hence becomes vectorizable. >> Here is the transformation: >> >> For T_SHORT (shift <= 16): >> src RShiftCntV shift src RShiftCntV shift >> \ / ==> \ / >> URShiftVS RShiftVS >> >> This patch does the transformation in SuperWord::implemented() and >> SuperWord::output(). It helps vectorize the short cases above. We >> can handle unsigned right shift on byte type in a similar way. The >> generated assembly code for one iteration on aarch64 is like: >> ``` >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> ``` >> >> Here is the performance data for micro-benchmark before and after >> this patch on both AArch64 and x64 machines. We can observe about >> ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ >> >> Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 > > test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 36: > >> 34: * @key randomness >> 35: * @summary Auto-vectorization enhancement for unsigned shift right on signed subword types >> 36: * @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" > > This IR test for vectorizable check looks good on AArch64. But AFAIK, some operations cannot be vectorized on old x86 CPUs with AVX=1. Could you add something like `(os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*")` to check the CPU feature? @merykitty @sviswa7 Could you help confirm if byte/short shift operations are vectorizable with all AVX versions of x86? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From duke at openjdk.java.net Wed Jun 1 05:18:28 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 05:18:28 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v12] In-Reply-To: References: Message-ID: > We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. > > > Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) > FloatClassCheck.testIsFinite 0.562 0.406 28% > FloatClassCheck.testIsInfinite 0.815 0.383 53% > FloatClassCheck.testIsNaN 0.63 0.382 39% > DoubleClassCheck.testIsFinite 0.565 0.409 28% > DoubleClassCheck.testIsInfinite 0.812 0.375 54% > DoubleClassCheck.testIsNaN 0.631 0.38 40% > FPComparison.isFiniteDouble 332.638 272.577 18% > FPComparison.isFiniteFloat 413.217 331.825 20% > FPComparison.isInfiniteDouble 874.897 240.632 72% > FPComparison.isInfiniteFloat 872.279 321.269 63% > FPComparison.isNanDouble 286.566 240.36 16% > FPComparison.isNanFloat 346.123 316.923 8% Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - merge latest master - update jmh bechmarks and jtreg tests - update vmstructs - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - Remove support for non vfpclasss/d based intrinsics - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - add comment for vfpclasss/d for isFinite() - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - zero out the upper bits not written by setb - ... and 6 more: https://git.openjdk.java.net/jdk/compare/3deb58a8...497c9741 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8459/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=11 Stats: 783 lines in 19 files changed: 783 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed Jun 1 06:18:33 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Wed, 1 Jun 2022 06:18:33 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v2] In-Reply-To: References: Message-ID: <6MK3dBH68YcwTttGFvH36_dgUuY9pv4MDn75EKj8ygE=.fae4cbad-f0b6-4bf8-a0cd-acc3bf2f8fb6@github.com> > When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), > there is no warning message (only can see info level messages if put -Xlog:os=info). > This should show a warning message to tell the user that you failed to load libraries for hsdis. > So I put a warning message to notify this. > > e.g. > ` Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: add the name of file that was searched to warning message ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8782/files - new: https://git.openjdk.java.net/jdk/pull/8782/files/090985bc..08bc492a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8782&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8782&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8782.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8782/head:pull/8782 PR: https://git.openjdk.java.net/jdk/pull/8782 From duke at openjdk.java.net Wed Jun 1 06:18:36 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Wed, 1 Jun 2022 06:18:36 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v2] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 00:46:03 GMT, Cesar Soares wrote: >> Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: >> >> add the name of file that was searched to warning message > > src/hotspot/share/compiler/disassembler.cpp line 841: > >> 839: os::dll_lookup(_library, decode_instructions_virtual_name)); >> 840: } else { >> 841: log_warning(os)("Try to load hsdis library failed"); > > NIT: I suggest adding the name of the file that was searched for and rewording this message a bit: "Failed to load hsdis library: hsdis-x86.so" @JohnTortugo Thank you for your advice !! I added the name of the file to the warning message. ------------- PR: https://git.openjdk.java.net/jdk/pull/8782 From duke at openjdk.java.net Wed Jun 1 06:42:20 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 06:42:20 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v13] In-Reply-To: References: Message-ID: > We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. > > > Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) > FloatClassCheck.testIsFinite 0.562 0.406 28% > FloatClassCheck.testIsInfinite 0.815 0.383 53% > FloatClassCheck.testIsNaN 0.63 0.382 39% > DoubleClassCheck.testIsFinite 0.565 0.409 28% > DoubleClassCheck.testIsInfinite 0.812 0.375 54% > DoubleClassCheck.testIsNaN 0.631 0.38 40% > FPComparison.isFiniteDouble 332.638 272.577 18% > FPComparison.isFiniteFloat 413.217 331.825 20% > FPComparison.isInfiniteDouble 874.897 240.632 72% > FPComparison.isInfiniteFloat 872.279 321.269 63% > FPComparison.isNanDouble 286.566 240.36 16% > FPComparison.isNanFloat 346.123 316.923 8% Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Support only IsInfinite with vfpclasss/d instruction ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8459/files - new: https://git.openjdk.java.net/jdk/pull/8459/files/497c9741..8244f25d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=12 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=11-12 Stats: 241 lines in 14 files changed: 0 ins; 239 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed Jun 1 07:07:38 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 07:07:38 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v9] In-Reply-To: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> References: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> Message-ID: On Tue, 24 May 2022 22:00:20 GMT, Vladimir Kozlov wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: >> >> - Remove support for non vfpclasss/d based intrinsics >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into float >> - add comment for vfpclasss/d for isFinite() >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into float >> - zero out the upper bits not written by setb >> - use 0x1 to be simpler >> - remove the redundant temp register >> - Split the macros using predicate >> - update jmh tests >> - Merge branch 'master' into float >> - ... and 1 more: https://git.openjdk.java.net/jdk/compare/c1db70d8...70bba0fe > > I assume `Baseline` data includes #8525 changes. Right? Hello Validmir (@vnkozlov), After reviewing the JMH performance data, it was observed that only the` isInfinite()` method sees performance benefit using intrinsics (`vfpclassss/d` instruction). Thus, this PR is updated to support intrinsics for **only** the `isInfinite()` method. Please see the performance data below for all the 3 methods. Please let me know if anything else is needed. Benchmark (ns/op) Baseline Intrinsic Speedup (%) (vfpclassss) ------------------------------------------------------------------- FloatClassCheck.testIsFiniteBranch 1.558 1.535 1% FloatClassCheck.testIsFiniteCMov 0.335 0.428 -28% FloatClassCheck.testIsFiniteStore 0.417 0.253 39% --------------------------------------------------------------------- FloatClassCheck.testIsInfiniteBranch 1.294 1.046 19% FloatClassCheck.testIsInfiniteCMov 0.823 0.351 57% FloatClassCheck.testIsInfiniteStore 0.748 0.234 69% ---------------------------------------------------------------------- FloatClassCheck.testIsNaNBranch 1 1.147 -15% FloatClassCheck.testIsNaNCMov 0.297 0.352 -19% FloatClassCheck.testIsNaNStore 0.362 0.234 35% ********************************************************************* Benchmark (ns/op) Baseline Intrinsic Speedup (%) (vfpclasssd) ------------------------------------------------------------------------ DoubleClassCheck.testIsFiniteBranch 1.555 1.522 2% DoubleClassCheck.testIsFiniteCMov 0.325 0.444 -37% DoubleClassCheck.testIsFiniteStore 0.322 0.266 17% ----------------------------------------------------------------------- DoubleClassCheck.testIsInfiniteBranch 1.261 1.094 13% DoubleClassCheck.testIsInfiniteCMov 0.821 0.373 55% DoubleClassCheck.testIsInfiniteStore 0.858 0.235 73% ----------------------------------------------------------------------- DoubleClassCheck.testIsNaNBranch 1 1.127 -13% DoubleClassCheck.testIsNaNCMov 0.278 0.373 -34% DoubleClassCheck.testIsNaNStore 0.283 0.235 17% ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From dlong at openjdk.java.net Wed Jun 1 07:40:33 2022 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 1 Jun 2022 07:40:33 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number [v2] In-Reply-To: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> References: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> Message-ID: On Sat, 28 May 2022 02:21:35 GMT, Dean Long wrote: >> This PR does two things: >> - reverts the incorrect change to non_data_bits that included pointer_bits >> - treats the data() as an unsigned int to prevent a high bit being treated as a negative number > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > set vreg_max to a more reasonable limit (10000) Thanks Christian. My first performance run showed some 1-2% regressions on a microbenchmark and a startup benchmark, so I ran it again and the regression disappeared, so I think it was noise. There was no regression for larger benchmarks like SPECjvm2008. ------------- PR: https://git.openjdk.java.net/jdk/pull/8912 From duke at openjdk.java.net Wed Jun 1 08:40:40 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Wed, 1 Jun 2022 08:40:40 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v3] In-Reply-To: References: Message-ID: <1xxFwpjB1CEfkCm2skGDVMMyZi2La0qCRIwNqz-pz3s=.d2f6ec3a-9d71-47b6-a101-74278ba602f9@github.com> > When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), > there is no warning message (only can see info level messages if put -Xlog:os=info). > This should show a warning message to tell the user that you failed to load libraries for hsdis. > So I put a warning message to notify this. > > e.g. > ` Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: Revert "add the name of file that was searched to warning message" This reverts commit 08bc492af45bf6fef82df0164f93dd4ecf321532. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8782/files - new: https://git.openjdk.java.net/jdk/pull/8782/files/08bc492a..0627c96c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8782&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8782&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8782.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8782/head:pull/8782 PR: https://git.openjdk.java.net/jdk/pull/8782 From duke at openjdk.java.net Wed Jun 1 08:58:27 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Wed, 1 Jun 2022 08:58:27 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v3] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 06:15:37 GMT, Yuta Sato wrote: >> src/hotspot/share/compiler/disassembler.cpp line 841: >> >>> 839: os::dll_lookup(_library, decode_instructions_virtual_name)); >>> 840: } else { >>> 841: log_warning(os)("Try to load hsdis library failed"); >> >> NIT: I suggest adding the name of the file that was searched for and rewording this message a bit: "Failed to load hsdis library: hsdis-x86.so" > > @JohnTortugo > Thank you for your advice !! > I added the name of the file to the warning message. After I consider it, it might be better not to add the name of the file to this warning. If I look up code again, `Disassembler::load_library` checks all patterns of hsdis library like I commented here (https://github.com/openjdk/jdk/pull/8782#issuecomment-1132489576). Because of this, this warning message should be for telling that "you failed to load all patterns of hsdis library". So I reverted my last commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/8782 From fgao at openjdk.java.net Wed Jun 1 08:59:36 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 1 Jun 2022 08:59:36 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: <7yWcjTQ2AndkFmKAc3Agv90AOLyShiPOvj6uY_67CPI=.3e5ce148-9c08-443d-a981-ffd2e29667e1@github.com> On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 > This patch can auto-vectorize `testShort1`, but fail to do so with `testShort2`. > > ```c++ > public static void testShort1(short[] a, short[] b, short[] c, short[] d) { > for (int i = 0; i < a.length; i++) { > b[i] = (short)(a[i] + 1); > d[i] = (short)(b[i] + c[i]); > } > } > > public static void testShort2(short[] a, short[] b, short[] c, short[] d) { > for (int i = 0; i < a.length; i++) { > b[i] = (short)(a[i] + 1); > d[i] = (short)(b[i] + c[i] + 1); > } > } > ``` > > So may I ask if it's possible to also auto-vectorize `testShort2` ? Thanks. Thanks for your review, @DamonFool . Yes. We can also vectorize `testShort2` by searching further and further to match our target pattern `(RShiftI _ (LShiftI _ valIn1 conIL ) conIR)` ![image](https://user-images.githubusercontent.com/39403138/171365978-96994ae8-4020-4111-a03d-dea6859b4db5.png) I implemented it in https://github.com/openjdk/jdk/pull/8968, which can vectorize `testShort2` as expected. But, as we can see, if we want to support more scenarios, we have to use deeper loop to search, increasing complexity. Maybe C2 assumes that all transformations in GVN could be light, so it runs GVN many times. If we support the scenario like `testShort2`, the transformation may become a little bit heavy. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From chagedorn at openjdk.java.net Wed Jun 1 09:23:43 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 1 Jun 2022 09:23:43 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 In-Reply-To: References: Message-ID: <7vFZ9ccGv7dGFqSzNw-3OA5SOML3kKH_wIyd-xOPSzE=.910106b9-e0b3-4b74-ab60-4b5bd74d5427@github.com> On Tue, 31 May 2022 23:29:56 GMT, Jie Fu wrote: > Shall we make a jtreg test for this fix? Thanks. That would be helpful. @sviswa7 I've attached a simpler reproducer to the JBS bug extracted from the full fuzzer test. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From chagedorn at openjdk.java.net Wed Jun 1 09:27:34 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 1 Jun 2022 09:27:34 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number [v2] In-Reply-To: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> References: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> Message-ID: <5MqLF_hxwlfwmVtPZDXNk63ajz5HmpKL00NbXMKOBeo=.8edfa2d7-4187-4113-b2cf-952ebe97ab4e@github.com> On Sat, 28 May 2022 02:21:35 GMT, Dean Long wrote: >> This PR does two things: >> - reverts the incorrect change to non_data_bits that included pointer_bits >> - treats the data() as an unsigned int to prevent a high bit being treated as a negative number > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > set vreg_max to a more reasonable limit (10000) Thanks Dean for evaluating the performance. Then I think it's good to go in. ------------- PR: https://git.openjdk.java.net/jdk/pull/8912 From jiefu at openjdk.java.net Wed Jun 1 09:53:36 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Jun 2022 09:53:36 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 So the current implementation only works for limited scenarios, right? I'm not sure if there is a good way to eliminate the useless sign extension in `testShort2`. But I really hope this opt can be used for more situations. Let's see if someone has a good idea. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From ysuenaga at openjdk.java.net Wed Jun 1 14:45:32 2022 From: ysuenaga at openjdk.java.net (Yasumasa Suenaga) Date: Wed, 1 Jun 2022 14:45:32 GMT Subject: Integrated: 8287491: compiler/jvmci/errors/TestInvalidDebugInfo.java fails new assert: assert((uint)t < T_CONFLICT + 1) failed: invalid type # In-Reply-To: References: Message-ID: On Tue, 31 May 2022 13:00:57 GMT, Yasumasa Suenaga wrote: > We saw new assertion error after [JDK-8286562](https://bugs.openjdk.java.net/browse/JDK-8286562) ( #8646 ) > > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/home/ysuenaga/github-forked/jdk/src/hotspot/share/utilities/globalDefinitions.hpp:735), pid=2619, tid=2635 > # assert((uint)t < T_CONFLICT + 1) failed: invalid type > > > It was caused by passing `JavaKind.Illegal` to slot kind at TestInvalidDebugInfo.java . We should pass `JavaKind.Void` instead of `JavKind.Illegal`. > > See JBS for more details (I attached hs_err log). This pull request has now been integrated. Changeset: e3791ecf Author: Yasumasa Suenaga URL: https://git.openjdk.java.net/jdk/commit/e3791ecfe42ccb34548dd23d159087a86b669a46 Stats: 5 lines in 2 files changed: 0 ins; 1 del; 4 mod 8287491: compiler/jvmci/errors/TestInvalidDebugInfo.java fails new assert: assert((uint)t < T_CONFLICT + 1) failed: invalid type # Reviewed-by: kvn, dnsimon ------------- PR: https://git.openjdk.java.net/jdk/pull/8954 From jbhateja at openjdk.java.net Wed Jun 1 15:29:38 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Jun 2022 15:29:38 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v13] In-Reply-To: References: Message-ID: <2j4Kdk5-bqddG3BPO6dUdsM3OmbancqL6CYg3yz3n18=.8483ca07-5f5f-4bd1-8183-88f1feccf183@github.com> On Wed, 1 Jun 2022 06:42:20 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. >> >> >> Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) >> FloatClassCheck.testIsFinite 0.562 0.406 28% >> FloatClassCheck.testIsInfinite 0.815 0.383 53% >> FloatClassCheck.testIsNaN 0.63 0.382 39% >> DoubleClassCheck.testIsFinite 0.565 0.409 28% >> DoubleClassCheck.testIsInfinite 0.812 0.375 54% >> DoubleClassCheck.testIsNaN 0.631 0.38 40% >> FPComparison.isFiniteDouble 332.638 272.577 18% >> FPComparison.isFiniteFloat 413.217 331.825 20% >> FPComparison.isInfiniteDouble 874.897 240.632 72% >> FPComparison.isInfiniteFloat 872.279 321.269 63% >> FPComparison.isNanDouble 286.566 240.36 16% >> FPComparison.isNanFloat 346.123 316.923 8% > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Support only IsInfinite with vfpclasss/d instruction src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5335: > 5333: > 5334: void C2_MacroAssembler::double_class_check_vfp(int opcode, Register dst, XMMRegister src, KRegister tmp) { > 5335: uint8_t imm8; May be ok to move it back to instruction encoding block , only two instructions. src/hotspot/cpu/x86/x86.ad line 10148: > 10146: instruct FloatClassCheck_reg_reg_vfpclass(rRegI dst, regF src, kReg ktmp, rFlagsReg cr) > 10147: %{ > 10148: predicate(VM_Version::supports_avx512dq()); Predicate is no longer needed you have done the check in match_rule_supported routine. src/hotspot/cpu/x86/x86.ad line 10161: > 10159: instruct DoubleClassCheck_reg_reg_vfpclass(rRegI dst, regD src, kReg ktmp, rFlagsReg cr) > 10160: %{ > 10161: predicate(VM_Version::supports_avx512dq()); Same as above ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Wed Jun 1 16:05:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Jun 2022 16:05:40 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v9] In-Reply-To: References: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> Message-ID: On Wed, 1 Jun 2022 07:05:34 GMT, Srinivas Vamsi Parasa wrote: >> I assume `Baseline` data includes #8525 changes. Right? > > Hello Validmir (@vnkozlov), > > After reviewing the JMH performance data, it was observed that only the` isInfinite()` method sees performance benefit using intrinsics (`vfpclassss/d` instruction). Thus, this PR is updated to support intrinsics for **only** the `isInfinite()` method. > > Please see the performance data below for all the 3 methods. Please let me know if anything else is needed. > > > Benchmark (ns/op) Baseline Intrinsic Speedup (%) > (vfpclassss) > ------------------------------------------------------------------- > FloatClassCheck.testIsFiniteBranch 1.558 1.535 1% > FloatClassCheck.testIsFiniteCMov 0.335 0.428 -28% > FloatClassCheck.testIsFiniteStore 0.417 0.253 39% > --------------------------------------------------------------------- > FloatClassCheck.testIsInfiniteBranch 1.294 1.046 19% > FloatClassCheck.testIsInfiniteCMov 0.823 0.351 57% > FloatClassCheck.testIsInfiniteStore 0.748 0.234 69% > ---------------------------------------------------------------------- > FloatClassCheck.testIsNaNBranch 1 1.147 -15% > FloatClassCheck.testIsNaNCMov 0.297 0.352 -19% > FloatClassCheck.testIsNaNStore 0.362 0.234 35% > > ********************************************************************* > > Benchmark (ns/op) Baseline Intrinsic Speedup (%) > (vfpclasssd) > ------------------------------------------------------------------------ > DoubleClassCheck.testIsFiniteBranch 1.555 1.522 2% > DoubleClassCheck.testIsFiniteCMov 0.325 0.444 -37% > DoubleClassCheck.testIsFiniteStore 0.322 0.266 17% > ----------------------------------------------------------------------- > DoubleClassCheck.testIsInfiniteBranch 1.261 1.094 13% > DoubleClassCheck.testIsInfiniteCMov 0.821 0.373 55% > DoubleClassCheck.testIsInfiniteStore 0.858 0.235 73% > ----------------------------------------------------------------------- > DoubleClassCheck.testIsNaNBranch 1 1.127 -13% > DoubleClassCheck.testIsNaNCMov 0.278 0.373 -34% > DoubleClassCheck.testIsNaNStore 0.283 0.235 17% @vamsi-parasa can you show difference in generated code for `DoubleClassCheck.testIsFiniteCMov ` for example? Or may be for all of them (for DoubleClassCheck). I am fine with intrinsifying only `isInfinite()` but would like to see code for the record. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed Jun 1 16:16:38 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 16:16:38 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v9] In-Reply-To: References: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> Message-ID: On Wed, 1 Jun 2022 07:05:34 GMT, Srinivas Vamsi Parasa wrote: >> I assume `Baseline` data includes #8525 changes. Right? > > Hello Validmir (@vnkozlov), > > After reviewing the JMH performance data, it was observed that only the` isInfinite()` method sees performance benefit using intrinsics (`vfpclassss/d` instruction). Thus, this PR is updated to support intrinsics for **only** the `isInfinite()` method. > > Please see the performance data below for all the 3 methods. Please let me know if anything else is needed. > > > Benchmark (ns/op) Baseline Intrinsic Speedup (%) > (vfpclassss) > ------------------------------------------------------------------- > FloatClassCheck.testIsFiniteBranch 1.558 1.535 1% > FloatClassCheck.testIsFiniteCMov 0.335 0.428 -28% > FloatClassCheck.testIsFiniteStore 0.417 0.253 39% > --------------------------------------------------------------------- > FloatClassCheck.testIsInfiniteBranch 1.294 1.046 19% > FloatClassCheck.testIsInfiniteCMov 0.823 0.351 57% > FloatClassCheck.testIsInfiniteStore 0.748 0.234 69% > ---------------------------------------------------------------------- > FloatClassCheck.testIsNaNBranch 1 1.147 -15% > FloatClassCheck.testIsNaNCMov 0.297 0.352 -19% > FloatClassCheck.testIsNaNStore 0.362 0.234 35% > > ********************************************************************* > > Benchmark (ns/op) Baseline Intrinsic Speedup (%) > (vfpclasssd) > ------------------------------------------------------------------------ > DoubleClassCheck.testIsFiniteBranch 1.555 1.522 2% > DoubleClassCheck.testIsFiniteCMov 0.325 0.444 -37% > DoubleClassCheck.testIsFiniteStore 0.322 0.266 17% > ----------------------------------------------------------------------- > DoubleClassCheck.testIsInfiniteBranch 1.261 1.094 13% > DoubleClassCheck.testIsInfiniteCMov 0.821 0.373 55% > DoubleClassCheck.testIsInfiniteStore 0.858 0.235 73% > ----------------------------------------------------------------------- > DoubleClassCheck.testIsNaNBranch 1 1.127 -13% > DoubleClassCheck.testIsNaNCMov 0.278 0.373 -34% > DoubleClassCheck.testIsNaNStore 0.283 0.235 17% > @vamsi-parasa can you show difference in generated code for `DoubleClassCheck.testIsFiniteCMov ` for example? Or may be for all of them (for DoubleClassCheck). I am fine with intrinsifying only `isInfinite()` but would like to see code for the record. Sure Vladimir, will post the generated code of all the case of DoubleClassCheck. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed Jun 1 16:16:42 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 16:16:42 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v13] In-Reply-To: <2j4Kdk5-bqddG3BPO6dUdsM3OmbancqL6CYg3yz3n18=.8483ca07-5f5f-4bd1-8183-88f1feccf183@github.com> References: <2j4Kdk5-bqddG3BPO6dUdsM3OmbancqL6CYg3yz3n18=.8483ca07-5f5f-4bd1-8183-88f1feccf183@github.com> Message-ID: <1OTjJH1S8y5nlBON-Y6zHiLhiNNy_FIxBGtZIaUAsEE=.01a9465b-3da0-406d-a575-c14cc015aeda@github.com> On Wed, 1 Jun 2022 15:24:19 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Support only IsInfinite with vfpclasss/d instruction > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5335: > >> 5333: >> 5334: void C2_MacroAssembler::double_class_check_vfp(int opcode, Register dst, XMMRegister src, KRegister tmp) { >> 5335: uint8_t imm8; > > May be ok to move it back to instruction encoding block , only two instructions. That's true. The two instructions can be put in the instruction encoding block. Will do that. > src/hotspot/cpu/x86/x86.ad line 10148: > >> 10146: instruct FloatClassCheck_reg_reg_vfpclass(rRegI dst, regF src, kReg ktmp, rFlagsReg cr) >> 10147: %{ >> 10148: predicate(VM_Version::supports_avx512dq()); > > Predicate is no longer needed you have done the check in match_rule_supported routine. That's true, the check is already done in the match rule. Will update the code. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From jrose at openjdk.java.net Wed Jun 1 16:41:45 2022 From: jrose at openjdk.java.net (John R Rose) Date: Wed, 1 Jun 2022 16:41:45 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v8] In-Reply-To: References: Message-ID: <1YSejh9qkhtZF4qX0rqx7qa0bhcg6AzoZYv5jrEHFbc=.741e508f-6907-4645-92d4-030397043e0c@github.com> On Mon, 30 May 2022 06:56:27 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Extending new IR value routines with value propagation logic. > - 8283894: Disabling sanity test as per review suggestion. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Removing CompressExpandSanityTest from problem list. > - 8283894: Updating test tag spec. > - 8283894: Review comments resolved. > - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. > - 8283894: Review comments resolutions. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - ... and 2 more: https://git.openjdk.java.net/jdk/compare/0b6737d2...a36dba2e Yes. Thank you. Final suggestion: Factor out 2 blocks of code that implement compress and expand, for C2 constant folding (Value methods), into 2 static routines. Maybe throw in a static assert for a smoke check on those factored routines. ------------- Marked as reviewed by jrose (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8498 From duke at openjdk.java.net Wed Jun 1 17:54:43 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Wed, 1 Jun 2022 17:54:43 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v15] In-Reply-To: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> References: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> Message-ID: On Mon, 23 May 2022 18:18:39 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 263, Arg escape = 87, Global escape = 1628 >> Objects scalar replaced = 193, Monitor objects removed = 32, GC barriers removed = 38, Memory barriers removed = 225 > > aamarsh has updated the pull request incrementally with two additional commits since the last revision: > > - delete iterative EA comment > - account for iterative EA Hi, can someone please sponsor this PR if all looks good? TIA! ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Wed Jun 1 18:31:27 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Jun 2022 18:31:27 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v15] In-Reply-To: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> References: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> Message-ID: On Mon, 23 May 2022 18:18:39 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 263, Arg escape = 87, Global escape = 1628 >> Objects scalar replaced = 193, Monitor objects removed = 32, GC barriers removed = 38, Memory barriers removed = 225 > > aamarsh has updated the pull request incrementally with two additional commits since the last revision: > > - delete iterative EA comment > - account for iterative EA Give me time to quick test to make sure it builds on all our systems. I will sponsor after that. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From dlong at openjdk.java.net Wed Jun 1 18:32:35 2022 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 1 Jun 2022 18:32:35 GMT Subject: Integrated: 8287396 LIR_Opr::vreg_number() and data() can return negative number In-Reply-To: References: Message-ID: On Fri, 27 May 2022 03:23:47 GMT, Dean Long wrote: > This PR does two things: > - reverts the incorrect change to non_data_bits that included pointer_bits > - treats the data() as an unsigned int to prevent a high bit being treated as a negative number This pull request has now been integrated. Changeset: cdb47688 Author: Dean Long URL: https://git.openjdk.java.net/jdk/commit/cdb476888a65b8ee2538f08b4b1dbb245874a262 Stats: 8 lines in 1 file changed: 2 ins; 1 del; 5 mod 8287396: LIR_Opr::vreg_number() and data() can return negative number Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8912 From duke at openjdk.java.net Wed Jun 1 18:35:29 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Wed, 1 Jun 2022 18:35:29 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v15] In-Reply-To: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> References: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> Message-ID: On Mon, 23 May 2022 18:18:39 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 263, Arg escape = 87, Global escape = 1628 >> Objects scalar replaced = 193, Monitor objects removed = 32, GC barriers removed = 38, Memory barriers removed = 225 > > aamarsh has updated the pull request incrementally with two additional commits since the last revision: > > - delete iterative EA comment > - account for iterative EA No worries. Thank you Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From xliu at openjdk.java.net Wed Jun 1 19:06:18 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 1 Jun 2022 19:06:18 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v7] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: Remove useless flag. if jdwp is on, liveness_at_bci() marks all local variables live. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/4fdd1c88..6e9f2670 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=05-06 Stats: 26 lines in 7 files changed: 1 ins; 15 del; 10 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Wed Jun 1 19:20:33 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 1 Jun 2022 19:20:33 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 20:15:05 GMT, Vladimir Kozlov wrote: > You need to verify that calls are removed from lists (for example call valueOf() from _boxing_late_inlines list). If calls are indeed removed from lists by calling igvn.optimize() then I will be fine with your code. Hi?@vnkozlov , from my reading, `valueOf()` in this [example](https://github.com/openjdk/jdk/pull/8545/files#diff-77b63ebfe062d7bec62f831c7a5f2c77ad9767f3d4a6be87ea22337e50b70192R47) is removed in late-inliner. It is pure and nobody consumes its return value. This is my goal in this PR. I will explore full-fledged useless elimination after [JDK-8287385](https://bugs.openjdk.java.net/browse/JDK-8287385). I think it can reveal more chances than this. I clean up this patch. could you take a look at this? thanks, --lx ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Wed Jun 1 19:34:32 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Jun 2022 19:34:32 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 19:16:47 GMT, Xin Liu wrote: > > You need to verify that calls are removed from lists (for example call valueOf() from _boxing_late_inlines list). If calls are indeed removed from lists by calling igvn.optimize() then I will be fine with your code. > > Hi?@vnkozlov , > > from my reading, `valueOf()` in this [example](https://github.com/openjdk/jdk/pull/8545/files#diff-77b63ebfe062d7bec62f831c7a5f2c77ad9767f3d4a6be87ea22337e50b70192R47) is removed in late-inliner. It is pure and nobody consumes its return value. This is my goal in this PR. Got it. > I will explore full-fledged useless elimination after [JDK-8287385](https://bugs.openjdk.java.net/browse/JDK-8287385). I think it can reveal more chances than this. I clean up this patch. could you take a look at this? Sounds good. I will review your latest changes today. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From duke at openjdk.java.net Wed Jun 1 19:56:02 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Wed, 1 Jun 2022 19:56:02 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 Added some "NITs". I'm not a reviewer so feel free to just ignore my comments :-) src/hotspot/share/opto/memnode.cpp line 2784: > 2782: // the given subword interger operation. So that we can optimize it out by > 2783: // StoreNode::Ideal_masked_or_sign_extended_input. > 2784: bool StoreNode::no_need_sign_extension(int opc) { NIT: suggest changing the method to receive "Node*". src/hotspot/share/opto/memnode.cpp line 2784: > 2782: // the given subword interger operation. So that we can optimize it out by > 2783: // StoreNode::Ideal_masked_or_sign_extended_input. > 2784: bool StoreNode::no_need_sign_extension(int opc) { NIT: suggest changing the method to receive "Node*". src/hotspot/share/opto/memnode.cpp line 2802: > 2800: // Check if the node can be recognized as the pattern: > 2801: // (AndI valIn conIa) and (conIa & mask == mask) > 2802: bool StoreNode::is_masked_input(Node* input, PhaseGVN* phase, uint mask) { NIT: suggest changing "phase" to be the first parameter. src/hotspot/share/opto/memnode.cpp line 2802: > 2800: // Check if the node can be recognized as the pattern: > 2801: // (AndI valIn conIa) and (conIa & mask == mask) > 2802: bool StoreNode::is_masked_input(Node* input, PhaseGVN* phase, uint mask) { NIT: suggest changing "phase" to be the first parameter. src/hotspot/share/opto/memnode.cpp line 2815: > 2813: // Check if the node can be recognized as the pattern > 2814: // (RShiftI _ (LShiftI _ valIn conIL ) conIR) and (conIL == conIR && conIR <= num_bits) > 2815: bool StoreNode::is_sign_extended_input(Node* input, PhaseGVN* phase, int num_bits) { NIT: suggest changing "phase" to be the first parameter. src/hotspot/share/opto/memnode.hpp line 562: > 560: virtual bool depends_only_on_test() const { return false; } > 561: > 562: bool used_only_for_this_opcode(Node* in); NIT: can these methods be const and/or static? src/hotspot/share/opto/memnode.hpp line 562: > 560: virtual bool depends_only_on_test() const { return false; } > 561: > 562: bool used_only_for_this_opcode(Node* in); NIT: can these methods be const and/or static? src/hotspot/share/opto/memnode.hpp line 567: > 565: bool is_sign_extended_input(Node* input, PhaseGVN* phase, int num_bits); > 566: > 567: Node* Ideal_masked_or_sign_extended_input(PhaseGVN* phase, uint mask, int num_bits); NIT: Looks like this method name follows the coding style. src/hotspot/share/opto/memnode.hpp line 567: > 565: bool is_sign_extended_input(Node* input, PhaseGVN* phase, int num_bits); > 566: > 567: Node* Ideal_masked_or_sign_extended_input(PhaseGVN* phase, uint mask, int num_bits); NIT: Looks like this method name follows the coding style. test/hotspot/jtreg/compiler/c2/irTests/TestIRStoreCorBAddIMask.java line 57: > 55: > 56: @Test > 57: @Arguments({Argument.DEFAULT, Argument.DEFAULT}) NIT: Won't be better to use RANDOM_EACH in the parameters of this and other methods? ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From kvn at openjdk.java.net Wed Jun 1 20:20:24 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Jun 2022 20:20:24 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v7] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 19:06:18 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless flag. if jdwp is on, liveness_at_bci() marks all local > variables live. Few comments. src/hotspot/share/opto/compile.cpp line 608: > 606: _expensive_nodes (comp_arena(), 8, 0, NULL), > 607: _for_post_loop_igvn(comp_arena(), 8, 0, NULL), > 608: _unstable_ifs (comp_arena(), 8, 0, NULL), I think `unstable_ifs` should be `unstable_if_traps` here and in all related method names and variables. You record traps now. src/hotspot/share/opto/compile.cpp line 776: > 774: > 775: preprocess_unstable_ifs(); > 776: Why call preprocessing here and not after `PhaseRemoveUseless` which can remove some paths? src/hotspot/share/opto/compile.cpp line 1864: > 1862: } > 1863: > 1864: void Compile::invalidate_unstable_if(CallStaticJavaNode* unc) { Add description comment for this method. What it is used for. src/hotspot/share/opto/compile.cpp line 1877: > 1875: uint unstable_ifs_all = 0; > 1876: > 1877: void Compile::preprocess_unstable_ifs() { I assume it will be used in product VM by next changes 8287385. Otherwise counters and method should be under `#ifndef PRODUCT` because they used only for statistics. src/hotspot/share/opto/compile.cpp line 1899: > 1897: int next_bci = trap->next_bci(); > 1898: > 1899: if (next_bci != -1 && !_dead_node_list.test(unc->_idx)) { Did you consider to remove items from `_unstable_ifs` in `Compile::remove_useless_nodes()` to avoid `_dead_node_list` check here? You would need specialized `remove_useless__unstable_ifs(useful)` src/hotspot/share/opto/parse.hpp line 611: > 609: class UnstableIfTrap { > 610: CallStaticJavaNode* _unc; > 611: Parse::Block* _path; // the pruned path `const` for these 2 fields? src/hotspot/share/opto/parse.hpp line 635: > 633: // or if _path has more than one predecessor and has been parsed, _unc does not mask out any real code. > 634: bool is_trivial() const { > 635: return _path->is_parsed(); Should you also check `_next_bci != -1` ? src/hotspot/share/opto/parse.hpp line 648: > 646: inline void* operator new(size_t x) throw() { > 647: Compile* C = Compile::current(); > 648: return C->node_arena()->AmallocWords(x); Use `comp_arena`, the same place where `_unstable_ifs` is allocated. src/hotspot/share/opto/parse1.cpp line 98: > 96: > 97: if (unstable_ifs_all) { > 98: tty->print_cr("%u trivial unstable_ifs (%2d%%)", trivial_unstable_ifs, "trivial unstable_ifs traps" test/hotspot/jtreg/compiler/c2/irTests/TestAggressiveLivenessForUnstableIf.java line 2: > 1: /* > 2: * Copyright Amazon.com Inc. or its affiliates. All Rights Reserved. Missing year. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From duke at openjdk.java.net Wed Jun 1 20:22:35 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 20:22:35 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v9] In-Reply-To: References: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> Message-ID: On Wed, 1 Jun 2022 16:14:37 GMT, Srinivas Vamsi Parasa wrote: > @vamsi-parasa can you show difference in generated code for `DoubleClassCheck.testIsFiniteCMov ` for example? Or may be for all of them (for DoubleClassCheck). I am fine with intrinsifying only `isInfinite()` but would like to see code for the record. Hi Vladimir (@vnkozlov), please see the generated code below. `isInfinite()` intrinsic is faster because the baseline produces` 2 vucomisd` instructions to compare with positive and negative infinity whereas the intrinsic generates only one `vfpclasssd` instruction as seen below: **DoubleClassCheck.testIsInfiniteBranch (baseline)** vucomisd -0xad(%rip),%xmm0 jp 0x00007fe1aced1f31 je 0x00007fe1aced1f3d vucomisd -0xb1(%rip),%xmm0 jp 0x00007fe1aced1ee0 jne 0x00007fe1aced1ee0 **DoubleClassCheck.testIsInfiniteBranch (intrinsic/vfpclasssd)** vfpclasssd $0x18,%xmm0,%k7 kmovb %k7,%edi test %edi,%edi je 0x00007f1270c167e0 ------------------------- However, in the case of` isFinite() `and `isNan()`, only `1 vucomisd` is generated and is faster for `Cmov` and` branch/call` tests as seen below **DoubleClassCheck.testIsFiniteCmov (baseline)** ``` vucomisd %xmm0,%xmm1 mov $0x9,%ecx cmovb %r8d,%ecx **DoubleClassCheck.testIsFiniteCmov (intrinsci/vfpclasssd)** vfpclasssd $0x99,%xmm2,%k7 kmovb %k7,%r11d xor $0x1,%r11d test %r11d,%r11d mov $0x7,%r11d cmovne %r8d,%r11d ------------------------------------------------------------------------- **DoubleClassCheck.testIsNaNCmov(baseline)** vucomisd %xmm0,%xmm0 jp 0x00007f0a4d1425fb movl $0x7,0x10(%rcx,%rax,4) **DoubleClassCheck.testIsNaNCmov (intrinsic/vfpclasssd)** vfpclasssd $0x81,%xmm0,%k7 kmovb %k7,%r8d test %r8d,%r8d jne 0x00007f7fa0c50f43 movl $0x7,0x10(%r10,%rdx,4) ``` Please let me know if further data is needed. Thanks, Vamsi ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed Jun 1 20:23:29 2022 From: duke at openjdk.java.net (aamarsh) Date: Wed, 1 Jun 2022 20:23:29 GMT Subject: Integrated: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics In-Reply-To: References: Message-ID: On Tue, 29 Mar 2022 17:31:29 GMT, aamarsh wrote: > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 263, Arg escape = 87, Global escape = 1628 > Objects scalar replaced = 193, Monitor objects removed = 32, GC barriers removed = 38, Memory barriers removed = 225 This pull request has now been integrated. Changeset: 2f191442 Author: Ana Marsh Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/2f1914424936eebd2478ca9d3100f88abb2d199c Stats: 112 lines in 5 files changed: 110 ins; 0 del; 2 mod 8282024: add EscapeAnalysis statistics under PrintOptoStatistics Reviewed-by: xliu, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Wed Jun 1 20:26:38 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 20:26:38 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v13] In-Reply-To: <1OTjJH1S8y5nlBON-Y6zHiLhiNNy_FIxBGtZIaUAsEE=.01a9465b-3da0-406d-a575-c14cc015aeda@github.com> References: <2j4Kdk5-bqddG3BPO6dUdsM3OmbancqL6CYg3yz3n18=.8483ca07-5f5f-4bd1-8183-88f1feccf183@github.com> <1OTjJH1S8y5nlBON-Y6zHiLhiNNy_FIxBGtZIaUAsEE=.01a9465b-3da0-406d-a575-c14cc015aeda@github.com> Message-ID: On Wed, 1 Jun 2022 16:11:37 GMT, Srinivas Vamsi Parasa wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5335: >> >>> 5333: >>> 5334: void C2_MacroAssembler::double_class_check_vfp(int opcode, Register dst, XMMRegister src, KRegister tmp) { >>> 5335: uint8_t imm8; >> >> May be ok to move it back to instruction encoding block , only two instructions. > > That's true. The two instructions can be put in the instruction encoding block. Will do that. Also, in future if we want to add the support for isFinite() and IsNaN(), wouldn't it be better to have seperate macros? ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From xliu at openjdk.java.net Wed Jun 1 20:35:37 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 1 Jun 2022 20:35:37 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v7] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 19:58:51 GMT, Vladimir Kozlov wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove useless flag. if jdwp is on, liveness_at_bci() marks all local >> variables live. > > src/hotspot/share/opto/compile.cpp line 1877: > >> 1875: uint unstable_ifs_all = 0; >> 1876: >> 1877: void Compile::preprocess_unstable_ifs() { > > I assume it will be used in product VM by next changes 8287385. > Otherwise counters and method should be under `#ifndef PRODUCT` because they used only for statistics. Yes, this is for 8287385. I will remove preprocessing logic from this PR. Because we can determine a unstable_if trap is trivial after parsing. My idea is to do the leftover parsing job in this preprocess function. That's why I think preprocess should before `PhaseRemoveUseless`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Wed Jun 1 20:45:35 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 1 Jun 2022 20:45:35 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v7] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 19:32:04 GMT, Vladimir Kozlov wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove useless flag. if jdwp is on, liveness_at_bci() marks all local >> variables live. > > test/hotspot/jtreg/compiler/c2/irTests/TestAggressiveLivenessForUnstableIf.java line 2: > >> 1: /* >> 2: * Copyright Amazon.com Inc. or its affiliates. All Rights Reserved. > > Missing year. I was told (by the company's open-source policy) to write a header like this. I guess they just want to reduce chore of maintaining years. This also consistent with other code we contributed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From jrose at openjdk.java.net Wed Jun 1 20:58:31 2022 From: jrose at openjdk.java.net (John R Rose) Date: Wed, 1 Jun 2022 20:58:31 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: <-YSBW13QmJqQ2mo-oXxRQWYmUepV6jVrZqXe4mZs7Ew=.77dc391a-fa8f-49a1-9cab-d809efa19730@github.com> On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 I will believe more in this transformation if it covers a wide range of expressions, not just a few specially-matched patterns `r = (short)(x + (int)y)`. Searching to an unbounded depth in one transform is usually a sign something is formulated wrong. Often it's possible to break the search up into a series of incremental transformations that, taken together, get to the desired end. In this case, I guess you want to be able to transform `r = (short)XYZ` such that all needless short-to-int conversions inside of `XYZ` are removed. A short-to-int conversion `S2I(x)` is needless if (a) it is an operand of a carryless or leftward-carrying ALU operation `S2I(x) op y`, and (b) all consumers of the operation only care about the low 16 bits of the result. I think that, whether or not this patch is accepted, and certainly before the next patch which handles ternary adds or SubI or MulI in a similar manner is proposed, we should tackle that more general problem. (And not just for 16-bit ints. If I do a mask `(x + y) & 0x7F` or sign extension `(x + y) <<7 >>7` I should be able to benefit from the same kinds of reasoning. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From kvn at openjdk.java.net Wed Jun 1 21:04:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Jun 2022 21:04:38 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v7] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 20:31:34 GMT, Xin Liu wrote: > Yes, this is for 8287385. I will remove preprocessing logic from this PR. Okay. > Because we can determine a unstable_if trap is trivial after parsing. My idea is to do the leftover parsing job in this preprocess function. That's why I think preprocess should before `PhaseRemoveUseless`. Then you should consider my proposal about cleaning the list and update counters in `Compile::remove_useless_nodes()`. Let discuss it in next changes. >> test/hotspot/jtreg/compiler/c2/irTests/TestAggressiveLivenessForUnstableIf.java line 2: >> >>> 1: /* >>> 2: * Copyright Amazon.com Inc. or its affiliates. All Rights Reserved. >> >> Missing year. > > I was told (by the company's open-source policy) to write a header like this. I guess they just want to reduce chore of maintaining years. This also consistent with other code we contributed. Yes, you are right. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Wed Jun 1 21:21:29 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Jun 2022 21:21:29 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v13] In-Reply-To: References: <2j4Kdk5-bqddG3BPO6dUdsM3OmbancqL6CYg3yz3n18=.8483ca07-5f5f-4bd1-8183-88f1feccf183@github.com> <1OTjJH1S8y5nlBON-Y6zHiLhiNNy_FIxBGtZIaUAsEE=.01a9465b-3da0-406d-a575-c14cc015aeda@github.com> Message-ID: On Wed, 1 Jun 2022 20:23:11 GMT, Srinivas Vamsi Parasa wrote: >> That's true. The two instructions can be put in the instruction encoding block. Will do that. > > Also, in future if we want to add the support for isFinite() and IsNaN(), wouldn't it be better to have seperate macros? I agree with Jatin's suggestion. "if" we add support in future we will separate them. Avoid over-complicating current code for some future development. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed Jun 1 21:34:39 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 21:34:39 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v13] In-Reply-To: References: <2j4Kdk5-bqddG3BPO6dUdsM3OmbancqL6CYg3yz3n18=.8483ca07-5f5f-4bd1-8183-88f1feccf183@github.com> <1OTjJH1S8y5nlBON-Y6zHiLhiNNy_FIxBGtZIaUAsEE=.01a9465b-3da0-406d-a575-c14cc015aeda@github.com> Message-ID: On Wed, 1 Jun 2022 21:16:59 GMT, Vladimir Kozlov wrote: >> Also, in future if we want to add the support for isFinite() and IsNaN(), wouldn't it be better to have seperate macros? > > I agree with Jatin's suggestion. "if" we add support in future we will separate them. Avoid over-complicating current code for some future development. Sure, will implement the suggestion made and update the code. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From sviswanathan at openjdk.java.net Wed Jun 1 22:34:43 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Jun 2022 22:34:43 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: <_c_QPZQIL-ZxBs9TaKmrh7_1WcbEDH1pUwhTpOc6PD8=.75e4a61b-ebb6-491c-9c5b-9a035f0b9eaf@github.com> References: <_c_QPZQIL-ZxBs9TaKmrh7_1WcbEDH1pUwhTpOc6PD8=.75e4a61b-ebb6-491c-9c5b-9a035f0b9eaf@github.com> Message-ID: On Fri, 13 May 2022 08:58:12 GMT, Xiaohong Gong wrote: >> Yes, the tests were run in debug mode. The reporting of the missing constant occurs for the compiled method that is called from the method where the constants are declared e.g.: >> >> 719 240 b jdk.incubator.vector.Int256Vector::fromArray0 (15 bytes) >> ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it >> ** missing constant: offsetInRange=Parm >> @ 11 jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline by annotation >> >> >> So it appears to be working as expected. A similar pattern occurs at a lower-level for the passing of the mask class. `Int256Vector::fromArray0` passes a constant class to `IntVector::fromArray0Template` (the compilation of which bails out before checking that the `offsetInRange` is constant). > > You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, and no failure and performance regression are found. I will update the patch soon. Thanks for the help! @XiaohongGong Could you please rebase the branch and resolve conflicts? ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From duke at openjdk.java.net Wed Jun 1 23:16:35 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 23:16:35 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: Message-ID: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> > We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. > > > Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) > FloatClassCheck.testIsFinite 0.562 0.406 28% > FloatClassCheck.testIsInfinite 0.815 0.383 53% > FloatClassCheck.testIsNaN 0.63 0.382 39% > DoubleClassCheck.testIsFinite 0.565 0.409 28% > DoubleClassCheck.testIsInfinite 0.812 0.375 54% > DoubleClassCheck.testIsNaN 0.631 0.38 40% > FPComparison.isFiniteDouble 332.638 272.577 18% > FPComparison.isFiniteFloat 413.217 331.825 20% > FPComparison.isInfiniteDouble 874.897 240.632 72% > FPComparison.isInfiniteFloat 872.279 321.269 63% > FPComparison.isNanDouble 286.566 240.36 16% > FPComparison.isNanFloat 346.123 316.923 8% Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: eliminate redundate macros ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8459/files - new: https://git.openjdk.java.net/jdk/pull/8459/files/8244f25d..a1086afd Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=13 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=12-13 Stats: 35 lines in 3 files changed: 0 ins; 31 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed Jun 1 23:16:36 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 1 Jun 2022 23:16:36 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v13] In-Reply-To: References: <2j4Kdk5-bqddG3BPO6dUdsM3OmbancqL6CYg3yz3n18=.8483ca07-5f5f-4bd1-8183-88f1feccf183@github.com> <1OTjJH1S8y5nlBON-Y6zHiLhiNNy_FIxBGtZIaUAsEE=.01a9465b-3da0-406d-a575-c14cc015aeda@github.com> Message-ID: On Wed, 1 Jun 2022 21:30:32 GMT, Srinivas Vamsi Parasa wrote: >> I agree with Jatin's suggestion. "if" we add support in future we will separate them. Avoid over-complicating current code for some future development. > > Sure, will implement the suggestion made and update the code. Updated the code with the suggested changes. Thanks Jatin and Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From sviswanathan at openjdk.java.net Wed Jun 1 23:36:02 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Jun 2022 23:36:02 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake Message-ID: We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. The performance regression is due to auto-vectorization of small loops. We don?t have AVX3Threshold consideration in auto-vectorization. The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. Please review. Best Regard, Sandhya ------------- Commit messages: - x86 build fix - Fix 32-bit build - review comment resolution - Change option name and add checks - Limit auto vectorization to 32 byte vector on Cascade Lake Changes: https://git.openjdk.java.net/jdk/pull/8877/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8877&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287697 Stats: 53 lines in 6 files changed: 45 ins; 0 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/8877.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8877/head:pull/8877 PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Wed Jun 1 23:36:03 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Jun 2022 23:36:03 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake In-Reply-To: References: Message-ID: On Wed, 25 May 2022 01:48:16 GMT, Sandhya Viswanathan wrote: > We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. > The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. > The performance regression is due to auto-vectorization of small loops. > We don?t have AVX3Threshold consideration in auto-vectorization. > The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. > > This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. > > Please review. > > Best Regard, > Sandhya You have trailing white spaces. src/hotspot/share/opto/vectornode.cpp line 1280: > 1278: (vlen > 1) && is_power_of_2(vlen) && > 1279: Matcher::vector_size_supported(bt, vlen) && > 1280: (vlen * type2aelembytes(bt) <= SuperWordMaxVectorSize)) { Can you put this whole condition into separate `static bool VectorNode::vector_size_supported(vlen, bt)` and use in both cases? ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From jbhateja at openjdk.java.net Wed Jun 1 23:36:03 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Jun 2022 23:36:03 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake In-Reply-To: References: Message-ID: <7QY-uFABXZhtfKn103X1lZU8sJT3KayRAbeyQ21xfK4=.cadebd12-a765-4371-abad-3549050bfb2c@github.com> On Wed, 25 May 2022 01:48:16 GMT, Sandhya Viswanathan wrote: > We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. > The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. > The performance regression is due to auto-vectorization of small loops. > We don?t have AVX3Threshold consideration in auto-vectorization. > The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. > > This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. > > Please review. > > Best Regard, > Sandhya Vectorization through SLP can be controlled by constraining MaxVectorSize and through Vector APIs using narrower SPECIES. Can you kindly share more details on need for a separate SuperWordMaxVectorSize here. User already has all the necessary controls to limit C2 vector length, it will rarely happen that one want to emit 512 vector code using vector APIs and still limit auto-vectorizer to infer 256 bit vector operations and vice-versa. May be we should pessimistically just constrain the vector size of those loops which may result into AVX512 heavy instructions through a target specific analysis pass. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From sviswanathan at openjdk.java.net Wed Jun 1 23:39:13 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Jun 2022 23:39:13 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake In-Reply-To: References: Message-ID: <37EIgoOQtTvySNmd1Q6hDs7JZku4UP2DAioShKwmPKs=.390e011b-3fda-48d7-bfe8-b9d260ca0822@github.com> On Fri, 27 May 2022 04:05:47 GMT, Vladimir Kozlov wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > You have trailing white spaces. @vnkozlov Your review comments are resolved. @jatin-bhateja This is a simple fix for the problem in the short time frame that we have for the upcoming feature freeze. A more complex fix to enhance auto-vectorizer is a good thought. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From sviswanathan at openjdk.java.net Wed Jun 1 23:57:26 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Jun 2022 23:57:26 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: References: Message-ID: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> > We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. > The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. > The performance regression is due to auto-vectorization of small loops. > We don?t have AVX3Threshold consideration in auto-vectorization. > The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. > > This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. > > Please review. > > Best Regard, > Sandhya Sandhya Viswanathan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Merge branch 'master' into maxvector - x86 build fix - Fix 32-bit build - review comment resolution - Change option name and add checks - Limit auto vectorization to 32 byte vector on Cascade Lake ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8877/files - new: https://git.openjdk.java.net/jdk/pull/8877/files/d677fd9a..7f4c41e2 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8877&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8877&range=00-01 Stats: 74281 lines in 805 files changed: 25008 ins; 42847 del; 6426 mod Patch: https://git.openjdk.java.net/jdk/pull/8877.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8877/head:pull/8877 PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Thu Jun 2 00:37:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 00:37:39 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: On Wed, 1 Jun 2022 23:16:35 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. >> >> >> Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) >> FloatClassCheck.testIsFinite 0.562 0.406 28% >> FloatClassCheck.testIsInfinite 0.815 0.383 53% >> FloatClassCheck.testIsNaN 0.63 0.382 39% >> DoubleClassCheck.testIsFinite 0.565 0.409 28% >> DoubleClassCheck.testIsInfinite 0.812 0.375 54% >> DoubleClassCheck.testIsNaN 0.631 0.38 40% >> FPComparison.isFiniteDouble 332.638 272.577 18% >> FPComparison.isFiniteFloat 413.217 331.825 20% >> FPComparison.isInfiniteDouble 874.897 240.632 72% >> FPComparison.isInfiniteFloat 872.279 321.269 63% >> FPComparison.isNanDouble 286.566 240.36 16% >> FPComparison.isNanFloat 346.123 316.923 8% > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > eliminate redundate macros Looks good. Let me test it before you push. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8459 From sviswanathan at openjdk.java.net Thu Jun 2 00:53:21 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 00:53:21 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 In-Reply-To: <7vFZ9ccGv7dGFqSzNw-3OA5SOML3kKH_wIyd-xOPSzE=.910106b9-e0b3-4b74-ab60-4b5bd74d5427@github.com> References: <7vFZ9ccGv7dGFqSzNw-3OA5SOML3kKH_wIyd-xOPSzE=.910106b9-e0b3-4b74-ab60-4b5bd74d5427@github.com> Message-ID: On Wed, 1 Jun 2022 09:20:23 GMT, Christian Hagedorn wrote: >> Shall we make a jtreg test for this fix? >> Thanks. > >> Shall we make a jtreg test for this fix? Thanks. > > That would be helpful. @sviswa7 I've attached a simpler reproducer to the JBS bug extracted from the full fuzzer test. @chhagedorn Thanks a lot. I will look into creating a jtreg based on the simpler reproducer. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From kvn at openjdk.java.net Thu Jun 2 01:20:28 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 01:20:28 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> Message-ID: On Wed, 1 Jun 2022 23:57:26 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Merge branch 'master' into maxvector > - x86 build fix > - Fix 32-bit build > - review comment resolution > - Change option name and add checks > - Limit auto vectorization to 32 byte vector on Cascade Lake I think we missed the test with setting `MaxVectorSize` to 32 (vs 64) on Cascade Lake CPU. We should do that. That may be preferable "simple fix" vs suggested changes for "short term solution". The objection was that user may still want to use wide 64 bytes vectors for Vector API. But I agree with Jatin argument about that. Limiting `MaxVectorSize` **will** affect our intrinsics/stubs code and may affect performance. That is why we need to test it. I will ask Eric. BTW, `SuperWordMaxVectorSize` should be diagnostic or experimental since it is temporary solution. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From duke at openjdk.java.net Thu Jun 2 01:24:36 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 2 Jun 2022 01:24:36 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: <_usq-_4l4l8j99gWAR6_oKwk_lt-ORd697lQXzYA1P4=.b5daa114-1616-4c3c-a8b3-cc2add76c0c1@github.com> On Thu, 2 Jun 2022 00:35:35 GMT, Vladimir Kozlov wrote: > Looks good. Let me test it before you push. Thanks Vladimir! Will wait until the tests pass... ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Thu Jun 2 01:33:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 01:33:39 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: On Wed, 1 Jun 2022 23:16:35 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. >> >> >> Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) >> FloatClassCheck.testIsFinite 0.562 0.406 28% >> FloatClassCheck.testIsInfinite 0.815 0.383 53% >> FloatClassCheck.testIsNaN 0.63 0.382 39% >> DoubleClassCheck.testIsFinite 0.565 0.409 28% >> DoubleClassCheck.testIsInfinite 0.812 0.375 54% >> DoubleClassCheck.testIsNaN 0.631 0.38 40% >> FPComparison.isFiniteDouble 332.638 272.577 18% >> FPComparison.isFiniteFloat 413.217 331.825 20% >> FPComparison.isInfiniteDouble 874.897 240.632 72% >> FPComparison.isInfiniteFloat 872.279 321.269 63% >> FPComparison.isNanDouble 286.566 240.36 16% >> FPComparison.isNanFloat 346.123 316.923 8% > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > eliminate redundate macros compiler/c2/irTests/TestScheduleSmallMethod.java failed in Tier1 (case #1 is run with "-XX:-OptoScheduling"): compiler.lib.ir_framework.shared.TestRunException: The following scenarios have failed: #1. Please check stderr for more information. at compiler.lib.ir_framework.TestFramework.reportScenarioFailures(TestFramework.java:617) at compiler.lib.ir_framework.TestFramework.startWithScenarios(TestFramework.java:578) at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:335) at compiler.c2.irTests.TestScheduleSmallMethod.main(TestScheduleSmallMethod.java:44) The test use Doube arithmetic. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Thu Jun 2 01:36:34 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 2 Jun 2022 01:36:34 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: On Thu, 2 Jun 2022 01:31:34 GMT, Vladimir Kozlov wrote: > compiler/c2/irTests/TestScheduleSmallMethod.java failed in Tier1 (case #1 is run with "-XX:-OptoScheduling"): > > ``` > compiler.lib.ir_framework.shared.TestRunException: The following scenarios have failed: #1. Please check stderr for more information. > at compiler.lib.ir_framework.TestFramework.reportScenarioFailures(TestFramework.java:617) > at compiler.lib.ir_framework.TestFramework.startWithScenarios(TestFramework.java:578) > at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:335) > at compiler.c2.irTests.TestScheduleSmallMethod.main(TestScheduleSmallMethod.java:44) > ``` > > The test use Doube arithmetic. Will look into it and debug it. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Thu Jun 2 01:43:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 01:43:38 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: On Wed, 1 Jun 2022 23:16:35 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. >> >> >> Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) >> FloatClassCheck.testIsFinite 0.562 0.406 28% >> FloatClassCheck.testIsInfinite 0.815 0.383 53% >> FloatClassCheck.testIsNaN 0.63 0.382 39% >> DoubleClassCheck.testIsFinite 0.565 0.409 28% >> DoubleClassCheck.testIsInfinite 0.812 0.375 54% >> DoubleClassCheck.testIsNaN 0.631 0.38 40% >> FPComparison.isFiniteDouble 332.638 272.577 18% >> FPComparison.isFiniteFloat 413.217 331.825 20% >> FPComparison.isInfiniteDouble 874.897 240.632 72% >> FPComparison.isInfiniteFloat 872.279 321.269 63% >> FPComparison.isNanDouble 286.566 240.36 16% >> FPComparison.isNanFloat 346.123 316.923 8% > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > eliminate redundate macros On other hand it failed on Windows version we have issue with recently. Several IR framework tests failed with "Did not find IR encoding": [8286979](https://bugs.openjdk.java.net/browse/JDK-8286979) Also it run on AMD CPU which does not have EVEX and AVX512. Your code should not be executed. False alarm it seems. I will run other tiers. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From xliu at openjdk.java.net Thu Jun 2 01:46:32 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 2 Jun 2022 01:46:32 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v8] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: Refactor per reviewer's feedback. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/6e9f2670..d7e5f062 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=06-07 Stats: 64 lines in 4 files changed: 26 ins; 11 del; 27 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From duke at openjdk.java.net Thu Jun 2 01:47:27 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 2 Jun 2022 01:47:27 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: <9Lg8k5G4JpALhsnH_17dlZFOkg_ZJmNqZMlrUJOvVvE=.8340706a-71b0-41fc-9a1c-be8de5209e44@github.com> On Thu, 2 Jun 2022 01:40:26 GMT, Vladimir Kozlov wrote: > On other hand it failed on Windows version we have issue with recently. Several IR framework tests failed with "Did not find IR encoding": [8286979](https://bugs.openjdk.java.net/browse/JDK-8286979) > > Also it run on AMD CPU which does not have EVEX and AVX512. Your code should not be executed. > > False alarm it seems. I will run other tiers. Will update the tests not to run on AMD CPUs which don't EVEX and AVX512. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From xgong at openjdk.java.net Thu Jun 2 01:52:35 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 2 Jun 2022 01:52:35 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: <_c_QPZQIL-ZxBs9TaKmrh7_1WcbEDH1pUwhTpOc6PD8=.75e4a61b-ebb6-491c-9c5b-9a035f0b9eaf@github.com> References: <_c_QPZQIL-ZxBs9TaKmrh7_1WcbEDH1pUwhTpOc6PD8=.75e4a61b-ebb6-491c-9c5b-9a035f0b9eaf@github.com> Message-ID: On Fri, 13 May 2022 08:58:12 GMT, Xiaohong Gong wrote: >> Yes, the tests were run in debug mode. The reporting of the missing constant occurs for the compiled method that is called from the method where the constants are declared e.g.: >> >> 719 240 b jdk.incubator.vector.Int256Vector::fromArray0 (15 bytes) >> ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it >> ** missing constant: offsetInRange=Parm >> @ 11 jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline by annotation >> >> >> So it appears to be working as expected. A similar pattern occurs at a lower-level for the passing of the mask class. `Int256Vector::fromArray0` passes a constant class to `IntVector::fromArray0Template` (the compilation of which bails out before checking that the `offsetInRange` is constant). > > You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, and no failure and performance regression are found. I will update the patch soon. Thanks for the help! > @XiaohongGong Could you please rebase the branch and resolve conflicts? Sure, I'm working on this now. The patch will be updated soon. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From kvn at openjdk.java.net Thu Jun 2 01:52:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 01:52:41 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: <9Lg8k5G4JpALhsnH_17dlZFOkg_ZJmNqZMlrUJOvVvE=.8340706a-71b0-41fc-9a1c-be8de5209e44@github.com> References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> <9Lg8k5G4JpALhsnH_17dlZFOkg_ZJmNqZMlrUJOvVvE=.8340706a-71b0-41fc-9a1c-be8de5209e44@github.com> Message-ID: On Thu, 2 Jun 2022 01:43:42 GMT, Srinivas Vamsi Parasa wrote: > > On other hand it failed on Windows version we have issue with recently. Several IR framework tests failed with "Did not find IR encoding": [8286979](https://bugs.openjdk.java.net/browse/JDK-8286979) > > Also it run on AMD CPU which does not have EVEX and AVX512. Your code should not be executed. > > False alarm it seems. I will run other tiers. > > Will update the tests not to run on AMD CPUs which don't EVEX and AVX512. Thanks! I don't think you need update any tests. Your new tests already have: `@requires vm.cpu.features ~= ".*avx512dq.*"` ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From xgong at openjdk.java.net Thu Jun 2 01:53:49 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 2 Jun 2022 01:53:49 GMT Subject: RFR: 8287028: AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 In-Reply-To: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> References: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> Message-ID: On Thu, 19 May 2022 14:08:05 GMT, Eric Liu wrote: > This patch implements AArch64 codegen for VectorLongToMask using the > SVE2 BitPerm feature. With this patch, the final code (generated on an > SVE vector reg size of 512-bit QEMU emulator) is shown as below: > > mov z17.b, #0 > mov v17.d[0], x13 > sunpklo z17.h, z17.b > sunpklo z17.s, z17.h > sunpklo z17.d, z17.s > mov z16.b, #1 > bdep z17.d, z17.d, z16.d > cmpne p0.b, p7/z, z17.b, #0 @theRealELiu , could you please rebase this patch and resolve the conflict? Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8789 From xgong at openjdk.java.net Thu Jun 2 03:27:59 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 2 Jun 2022 03:27:59 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5] In-Reply-To: References: Message-ID: > Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, > byte[] a, int offset, > VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'jdk:master' into JDK-8283667 - Use integer constant for offsetInRange all the way through - Rename "use_predicate" to "needs_predicate" - Rename the "usePred" to "offsetInRange" - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature ------------- Changes: https://git.openjdk.java.net/jdk/pull/8035/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=04 Stats: 447 lines in 43 files changed: 168 ins; 21 del; 258 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Thu Jun 2 03:28:00 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 2 Jun 2022 03:28:00 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: <_c_QPZQIL-ZxBs9TaKmrh7_1WcbEDH1pUwhTpOc6PD8=.75e4a61b-ebb6-491c-9c5b-9a035f0b9eaf@github.com> Message-ID: <7BACvqeUZFJbVq36mElnVBWg2vXyN6kVUXYNKvJ7cuA=.a04e6924-006b-43f3-adec-97132d5a719d@github.com> On Thu, 2 Jun 2022 01:49:10 GMT, Xiaohong Gong wrote: > > @XiaohongGong Could you please rebase the branch and resolve conflicts? > > Sure, I'm working on this now. The patch will be updated soon. Thanks. Resolved the conflicts. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From duke at openjdk.java.net Thu Jun 2 03:58:46 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 2 Jun 2022 03:58:46 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> <9Lg8k5G4JpALhsnH_17dlZFOkg_ZJmNqZMlrUJOvVvE=.8340706a-71b0-41fc-9a1c-be8de5209e44@github.com> Message-ID: On Thu, 2 Jun 2022 01:49:08 GMT, Vladimir Kozlov wrote: > > > On other hand it failed on Windows version we have issue with recently. Several IR framework tests failed with "Did not find IR encoding": [8286979](https://bugs.openjdk.java.net/browse/JDK-8286979) > > > Also it run on AMD CPU which does not have EVEX and AVX512. Your code should not be executed. > > > False alarm it seems. I will run other tiers. > > > > > > Will update the tests not to run on AMD CPUs which don't EVEX and AVX512. Thanks! > > I don't think you need update any tests. Your new tests already have: `@requires vm.cpu.features ~= ".*avx512dq.*"` Got it! ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From sviswanathan at openjdk.java.net Thu Jun 2 04:34:27 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 04:34:27 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v3] In-Reply-To: References: Message-ID: > We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. > The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. > The performance regression is due to auto-vectorization of small loops. > We don?t have AVX3Threshold consideration in auto-vectorization. > The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. > > This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. > > Please review. > > Best Regard, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Change SuperWordMaxVectorSize to develop option ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8877/files - new: https://git.openjdk.java.net/jdk/pull/8877/files/7f4c41e2..e8ea837a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8877&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8877&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8877.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8877/head:pull/8877 PR: https://git.openjdk.java.net/jdk/pull/8877 From sviswanathan at openjdk.java.net Thu Jun 2 04:41:35 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 04:41:35 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> Message-ID: <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> On Thu, 2 Jun 2022 01:16:33 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Merge branch 'master' into maxvector >> - x86 build fix >> - Fix 32-bit build >> - review comment resolution >> - Change option name and add checks >> - Limit auto vectorization to 32 byte vector on Cascade Lake > > I think we missed the test with setting `MaxVectorSize` to 32 (vs 64) on Cascade Lake CPU. We should do that. > > That may be preferable "simple fix" vs suggested changes for "short term solution". > > The objection was that user may still want to use wide 64 bytes vectors for Vector API. But I agree with Jatin argument about that. > Limiting `MaxVectorSize` **will** affect our intrinsics/stubs code and may affect performance. That is why we need to test it. I will ask Eric. > > BTW, `SuperWordMaxVectorSize` should be diagnostic or experimental since it is temporary solution. @vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Thu Jun 2 05:28:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 05:28:34 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v3] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 04:34:27 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Change SuperWordMaxVectorSize to develop option Changes look good. I will start testing it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Thu Jun 2 05:28:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 05:28:39 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> Message-ID: On Thu, 2 Jun 2022 01:16:33 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Merge branch 'master' into maxvector >> - x86 build fix >> - Fix 32-bit build >> - review comment resolution >> - Change option name and add checks >> - Limit auto vectorization to 32 byte vector on Cascade Lake > > I think we missed the test with setting `MaxVectorSize` to 32 (vs 64) on Cascade Lake CPU. We should do that. > > That may be preferable "simple fix" vs suggested changes for "short term solution". > > The objection was that user may still want to use wide 64 bytes vectors for Vector API. But I agree with Jatin argument about that. > Limiting `MaxVectorSize` **will** affect our intrinsics/stubs code and may affect performance. That is why we need to test it. I will ask Eric. > > BTW, `SuperWordMaxVectorSize` should be diagnostic or experimental since it is temporary solution. > @vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake. Thank you for checking stubs code. We still have to run performance testing with this patch. We need only additional run with `MaxVectorSize=32` to compare results. And I want @jatin-bhateja to approve this change too. Or give better suggestion. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Thu Jun 2 05:35:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 05:35:40 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: On Wed, 1 Jun 2022 23:16:35 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. >> >> >> Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) >> FloatClassCheck.testIsFinite 0.562 0.406 28% >> FloatClassCheck.testIsInfinite 0.815 0.383 53% >> FloatClassCheck.testIsNaN 0.63 0.382 39% >> DoubleClassCheck.testIsFinite 0.565 0.409 28% >> DoubleClassCheck.testIsInfinite 0.812 0.375 54% >> DoubleClassCheck.testIsNaN 0.631 0.38 40% >> FPComparison.isFiniteDouble 332.638 272.577 18% >> FPComparison.isFiniteFloat 413.217 331.825 20% >> FPComparison.isInfiniteDouble 874.897 240.632 72% >> FPComparison.isInfiniteFloat 872.279 321.269 63% >> FPComparison.isNanDouble 286.566 240.36 16% >> FPComparison.isNanFloat 346.123 316.923 8% > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > eliminate redundate macros Testing results are good. You need second review approval. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Thu Jun 2 05:42:36 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 2 Jun 2022 05:42:36 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v3] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 04:34:27 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Change SuperWordMaxVectorSize to develop option src/hotspot/cpu/x86/vm_version_x86.cpp line 1306: > 1304: // Limit auto vectorization to 256 bit (32 byte) by default on Cascade Lake > 1305: FLAG_SET_DEFAULT(SuperWordMaxVectorSize, 32); > 1306: } else { SuperWordMaxVectorSize is set to 32 bytes by default, it should still be capped by MaxVectorSize, in case user sets MaxVectorSize to 16 bytes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From duke at openjdk.java.net Thu Jun 2 05:42:37 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 2 Jun 2022 05:42:37 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: On Thu, 2 Jun 2022 05:33:29 GMT, Vladimir Kozlov wrote: > Testing results are good. > > You need second review approval. Thanks Vladimir! Will ask for second review/approval. For the github pre-submit tests, on Windows x86 (Tier 1), HighlightUITest.java is failing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Thu Jun 2 05:50:31 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 2 Jun 2022 05:50:31 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> Message-ID: On Thu, 2 Jun 2022 05:24:51 GMT, Vladimir Kozlov wrote: >> I think we missed the test with setting `MaxVectorSize` to 32 (vs 64) on Cascade Lake CPU. We should do that. >> >> That may be preferable "simple fix" vs suggested changes for "short term solution". >> >> The objection was that user may still want to use wide 64 bytes vectors for Vector API. But I agree with Jatin argument about that. >> Limiting `MaxVectorSize` **will** affect our intrinsics/stubs code and may affect performance. That is why we need to test it. I will ask Eric. >> >> BTW, `SuperWordMaxVectorSize` should be diagnostic or experimental since it is temporary solution. > >> @vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake. > > Thank you for checking stubs code. > > We still have to run performance testing with this patch. We need only additional run with `MaxVectorSize=32` to compare results. > > And I want @jatin-bhateja to approve this change too. Or give better suggestion. > @vnkozlov Your review comments are resolved. @jatin-bhateja This is a simple fix for the problem in the short time frame that we have for the upcoming feature freeze. A more complex fix to enhance auto-vectorizer is a good thought. Hi @sviswa7 . This looks reasonable since stubs and some macro assembly routines anyways operate under thresholds and does not strictly comply with max vector size. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Thu Jun 2 05:50:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 05:50:35 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v3] In-Reply-To: References: Message-ID: <6_2nfX-6yq4hxYp5OTnmmfO0WqSTJQcilG4N7tVN58o=.e5287dfa-2d07-4437-807e-71fd8fc6cffd@github.com> On Thu, 2 Jun 2022 04:34:27 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Change SuperWordMaxVectorSize to develop option src/hotspot/share/opto/c2_globals.hpp line 85: > 83: range(0, max_jint) \ > 84: \ > 85: develop(intx, SuperWordMaxVectorSize, 64, \ The flag can't be develop because it is used in product code. It should be `diagnostic`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Thu Jun 2 06:06:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 06:06:41 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: <8hRIYi8cBFptlhj-0bkKQcoZHHsT09vkw7pA7RkjdMw=.04923755-27d6-4a37-b6f6-9b06f6615ee6@github.com> On Thu, 2 Jun 2022 05:38:55 GMT, Srinivas Vamsi Parasa wrote: > > Testing results are good. > > You need second review approval. > > Thanks Vladimir! Will ask for second review/approval. For the github pre-submit tests, on Windows x86 (Tier 1), HighlightUITest.java is failing. known issue: [JDK-8284144](https://bugs.openjdk.java.net/browse/JDK-8284144) ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From fgao at openjdk.java.net Thu Jun 2 06:21:42 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 2 Jun 2022 06:21:42 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: <-YSBW13QmJqQ2mo-oXxRQWYmUepV6jVrZqXe4mZs7Ew=.77dc391a-fa8f-49a1-9cab-d809efa19730@github.com> References: <-YSBW13QmJqQ2mo-oXxRQWYmUepV6jVrZqXe4mZs7Ew=.77dc391a-fa8f-49a1-9cab-d809efa19730@github.com> Message-ID: On Wed, 1 Jun 2022 20:55:16 GMT, John R Rose wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into fg8282470 >> >> Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b >> - Merge branch 'master' into fg8282470 >> >> Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 >> - 8282470: Eliminate useless sign extension before some subword integer operations >> >> Some loop cases of subword types, including byte and >> short, can't be vectorized by C2's SLP. Here is an example: >> ``` >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // *line A* >> sres[i] = (short) (b[i] + c[i]); // *line B* >> } >> } >> ``` >> However, similar cases of int/float/double/long/char type can >> be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is >> that, as illustrated here[1], the result of the scalar add >> operation on *line A* has been promoted to int type. It needs >> to be narrowed to short type first before it can work as one >> of source operands of addition on *line B*. The demotion is >> done by left-shifting 16 bits then right-shifting 16 bits. >> The ideal graph for the process is showed like below. >> >> LoadS a[i] 8 >> \ / >> AddI (line A) >> / \ >> StoreC b[i] Lshift 16bits >> \ >> RShiftI 16 bits LoadS c[i] >> \ / >> AddI (line B) >> \ >> StoreC sres[i] >> >> In SLP, for most short-type cases, we can determine the precise >> type of the scalar int-type operation and finally execute it >> with short-type vector operations[2], except rshift opcode and >> abs in some situations[3]. But in this case, the source operand >> of RShiftI is from LShiftI rather than from any LoadS[4], so we >> can't determine its real type and conservatively assign it with >> int type rather than real short type. The int-type opearation >> RShiftI here can't be vectorized together with other short-type >> operations, like AddI(line B). The reason for byte loop cases >> is the same. Similar loop cases of char type could be >> vectorized because its demotion from int to char is done by >> `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like >> `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short >> cases, to vectorize more scenarios. Optimizing it in the >> mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension >> before some subword integer operations like: >> >> ``` >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> ``` >> to >> ``` >> short s = (short) (x OP y); >> ``` >> >> In the patch, assuming that `x` can be any int number, we need >> guarantee that the optimization doesn't have any impact on >> result. Not all arithmetic logic OPs meet the requirements. For >> example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, >> `short s = (short) (((131068 << 16) >> 16) / 50)` is not >> equal to `short s = (short) (131068 / 50)`. When OP is division, >> we may get different result with or without demotion >> before OP, because the upper 16 bits of division may have >> influence on the lower 16 bits of result, which can't be >> optimized. All optimizable opcodes are listed in >> StoreNode::no_need_sign_extension(), whose upper 16 bits of src >> operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower >> 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> ``` >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> ``` >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before >> and after this patch on both AArch64 and x64 machines. >> We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 >> >> Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 > > I will believe more in this transformation if it covers a wide range of expressions, not just a few specially-matched patterns `r = (short)(x + (int)y)`. > > Searching to an unbounded depth in one transform is usually a sign something is formulated wrong. Often it's possible to break the search up into a series of incremental transformations that, taken together, get to the desired end. > > In this case, I guess you want to be able to transform `r = (short)XYZ` such that all needless short-to-int conversions inside of `XYZ` are removed. A short-to-int conversion `S2I(x)` is needless if (a) it is an operand of a carryless or leftward-carrying ALU operation `S2I(x) op y`, and (b) all consumers of the operation only care about the low 16 bits of the result. > > I think that, whether or not this patch is accepted, and certainly before the next patch which handles ternary adds or SubI or MulI in a similar manner is proposed, we should tackle that more general problem. (And not just for 16-bit ints. If I do a mask `(x + y) & 0x7F` or sign extension `(x + y) <<7 >>7` I should be able to benefit from the same kinds of reasoning. Thanks for your review and kind suggestions, @rose00 @DamonFool @JohnTortugo I really understand your concern and agree. Frankly, I attempted to seek for a common solution to cover a wider range but failed. But I will give a more try to figure it out if we can make this transformation not that specific and more general, before continuing proceeding with this patch. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From kvn at openjdk.java.net Thu Jun 2 06:25:27 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 06:25:27 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v3] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 05:38:57 GMT, Jatin Bhateja wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Change SuperWordMaxVectorSize to develop option > > src/hotspot/cpu/x86/vm_version_x86.cpp line 1306: > >> 1304: // Limit auto vectorization to 256 bit (32 byte) by default on Cascade Lake >> 1305: FLAG_SET_DEFAULT(SuperWordMaxVectorSize, 32); >> 1306: } else { > > SuperWordMaxVectorSize is set to 32 bytes by default, it should still be capped by MaxVectorSize, in case user sets MaxVectorSize to 16 bytes. Yes. I submitted testing with `FLAG_SET_DEFAULT(SuperWordMaxVectorSize, MIN2(MaxVectorSize, (intx)32));` And the flag declared as `DIAGNOSTIC` - product build fail otherwise. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From epeter at openjdk.java.net Thu Jun 2 06:53:34 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Thu, 2 Jun 2022 06:53:34 GMT Subject: Integrated: 8283466: C2: missing skeleton predicates in peeled loop In-Reply-To: <_DPfzm_6zsaUDuRyXXHK_rZYVTMYmJ_JMcUpoBfb6kA=.13ac6f8b-5aba-42b9-805e-174c83e45816@github.com> References: <_DPfzm_6zsaUDuRyXXHK_rZYVTMYmJ_JMcUpoBfb6kA=.13ac6f8b-5aba-42b9-805e-174c83e45816@github.com> Message-ID: On Thu, 19 May 2022 08:56:22 GMT, Emanuel Peter wrote: > Implemented initializing skeleton predicates for the peeled loop. > > We have some predicates / checks before loops that are dependent on the range of the loop, they are checked at runtime. When we split a loop (eg. peeling, pre/main/post, unswitching) one of the sub-loops may get impossible data types and remove the data flow. For static type analysis for the control flow, we need so called skeleton predicates that implement these loop checks before each split off loop. If we do not do this static analysis we generally get `bad graph` asserts, as only removing data flow and not control flow leads to broken graphs. > > This was already implemented for pre/main/post loops and loop unswitching, but not for peeling. > > Ran large test suite. > Manual inspection shows that the instantiated skeleton predicate indeed collapses, in the provided regression test. > > Rerunning some tests now... This pull request has now been integrated. Changeset: 199832a7 Author: Emanuel Peter URL: https://git.openjdk.java.net/jdk/commit/199832a7101ca9dbfe7744ca0a1c4ff11d8832f2 Stats: 248 lines in 4 files changed: 211 ins; 0 del; 37 mod 8283466: C2: missing skeleton predicates in peeled loop Reviewed-by: roland, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8783 From jbhateja at openjdk.java.net Thu Jun 2 08:01:29 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 2 Jun 2022 08:01:29 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: <8EBxFu5AT-0OIyeweD_7IyiTQ5y6J3vpG-g9-sr9Gpw=.886642f3-564e-458e-9a47-f0f7e9649d84@github.com> On Wed, 1 Jun 2022 23:16:35 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. >> >> >> Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) >> FloatClassCheck.testIsFinite 0.562 0.406 28% >> FloatClassCheck.testIsInfinite 0.815 0.383 53% >> FloatClassCheck.testIsNaN 0.63 0.382 39% >> DoubleClassCheck.testIsFinite 0.565 0.409 28% >> DoubleClassCheck.testIsInfinite 0.812 0.375 54% >> DoubleClassCheck.testIsNaN 0.631 0.38 40% >> FPComparison.isFiniteDouble 332.638 272.577 18% >> FPComparison.isFiniteFloat 413.217 331.825 20% >> FPComparison.isInfiniteDouble 874.897 240.632 72% >> FPComparison.isInfiniteFloat 872.279 321.269 63% >> FPComparison.isNanDouble 286.566 240.36 16% >> FPComparison.isNanFloat 346.123 316.923 8% > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > eliminate redundate macros Marked as reviewed by jbhateja (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Thu Jun 2 08:01:35 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 2 Jun 2022 08:01:35 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7] In-Reply-To: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> References: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> Message-ID: On Sat, 16 Apr 2022 11:24:57 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits: > > - Resolve conflict > - ins_cost > - movzx is not elided with same input and output > - fix only the needs > - fix > - cisc > - delete benchmark command > - pipe > - fix, benchmarks > - pipe_class > - ... and 5 more: https://git.openjdk.java.net/jdk/compare/e5041ae3...337c0bf3 Hi, may I have a second review for this patch, please? Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From eliu at openjdk.java.net Thu Jun 2 08:24:56 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 2 Jun 2022 08:24:56 GMT Subject: RFR: 8287028: AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 [v2] In-Reply-To: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> References: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> Message-ID: > This patch implements AArch64 codegen for VectorLongToMask using the > SVE2 BitPerm feature. With this patch, the final code (generated on an > SVE vector reg size of 512-bit QEMU emulator) is shown as below: > > mov z17.b, #0 > mov v17.d[0], x13 > sunpklo z17.h, z17.b > sunpklo z17.s, z17.h > sunpklo z17.d, z17.s > mov z16.b, #1 > bdep z17.d, z17.d, z16.d > cmpne p0.b, p7/z, z17.b, #0 Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge jdk:master Change-Id: I7cea9b028f60c447f7cc24a00d38f59e0f07ecd3 - AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 This patch implements AArch64 codegen for VectorLongToMask using the SVE2 BitPerm feature. With this patch, the final code (generated on an SVE vector reg size of 512-bit QEMU emulator) is shown as below: mov z17.b, #0 mov v17.d[0], x13 sunpklo z17.h, z17.b sunpklo z17.s, z17.h sunpklo z17.d, z17.s mov z16.b, #1 bdep z17.d, z17.d, z16.d cmpne p0.b, p7/z, z17.b, #0 Change-Id: I9135fce39c8a08c72b757c78b258f5d968baa7ff ------------- Changes: https://git.openjdk.java.net/jdk/pull/8789/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8789&range=01 Stats: 133 lines in 8 files changed: 101 ins; 0 del; 32 mod Patch: https://git.openjdk.java.net/jdk/pull/8789.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8789/head:pull/8789 PR: https://git.openjdk.java.net/jdk/pull/8789 From xlinzheng at openjdk.java.net Thu Jun 2 09:53:25 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Thu, 2 Jun 2022 09:53:25 GMT Subject: RFR: 8287425: Remove unnecessary register push for MacroAssembler::check_klass_subtype_slow_path In-Reply-To: References: Message-ID: On Fri, 27 May 2022 09:10:48 GMT, Xiaolin Zheng wrote: > Hi team, > > ![AE98A8E7-9F6F-4722-B310-299A9A96A957](https://user-images.githubusercontent.com/38156692/170670906-2ce37a13-af21-4cf8-acbd-ca24528bc3a9.png) > > Some perf results show unnecessary pushes in `MacroAssembler::check_klass_subtype_slow_path()` under `UseCompressedOops`. History logs show the original code is like [1], and it gets refactored in [JDK-6813212](https://bugs.openjdk.java.net/browse/JDK-6813212), and the counterparts of the `UseCompressedOops` in the diff are at [2] and [3], and we could see the push of rax is just because `encode_heap_oop_not_null()` would kill it, so here needs a push and restore. After that, [JDK-6964458](https://bugs.openjdk.java.net/browse/JDK-6964458) (removal of perm gen) at [4] removed [3] so that there is no need to do UseCompressedOops work in `MacroAssembler::check_klass_subtype_slow_path()`; but in that patch [2] didn't get removed, so we finally come here. As a result, [2] could also be safely removed. > > (Files in [4] are folded because the patch is too large. We could manually unfold `hotspot/src/cpu/x86/vm/assembler_x86.cpp` to see that diff) > > I was wondering if this minor change could be sponsored? > > This enhancement is raised on behalf of Wei Kuai . > > Tested x86_64 hotspot tier1~tier4 twice, aarch64 hotspot tier1~tier4 once with another jdk tier1 once, and riscv64 hotspot tier1~tier4 once. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/de67e5294982ce197f2abd051cbb1c8aa6c29499/hotspot/src/cpu/x86/vm/interp_masm_x86_64.cpp#L273-L284 > [2] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7441-R7444 > [3] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7466-R7477 > [4] https://github.com/openjdk/jdk/commit/5c58d27aac7b291b879a7a3ff6f39fca25619103#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3L9347-L9361 Thank you for reviewing, Vladimir! Then I'd merge this trivial enhancement if no other comments. ------------- PR: https://git.openjdk.java.net/jdk/pull/8915 From epeter at openjdk.java.net Thu Jun 2 10:27:16 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Thu, 2 Jun 2022 10:27:16 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v24] In-Reply-To: References: Message-ID: <5Lg5ECafkez-DpxczVGhDbMPiFhsTnxNVjliaqtXlW8=.d8444062-af9a-4c34-b71f-32d59995ae5e@github.com> > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > To get familiar with the many options, run this to get help: > `find_node(0)->dump_bfs(0,0,"h")` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **Exploring loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Christian's review ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/3199ade6..fd113695 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=23 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=22-23 Stats: 99 lines in 1 file changed: 1 ins; 2 del; 96 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From epeter at openjdk.java.net Thu Jun 2 13:26:06 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Thu, 2 Jun 2022 13:26:06 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v25] In-Reply-To: References: Message-ID: <0-_fGUB3L4Jgd0dwD3Hn4kjjktQoomd6APq3Rkgfu8E=.cdf578dd-91b9-4d11-a0ac-ed74fc7c96d2@github.com> > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > To get familiar with the many options, run this to get help: > `find_node(0)->dump_bfs(0,0,"h")` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **Exploring loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fixed formatting a big ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/fd113695..fcdb4335 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=24 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=23-24 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From epeter at openjdk.java.net Thu Jun 2 13:53:32 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Thu, 2 Jun 2022 13:53:32 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v26] In-Reply-To: References: Message-ID: > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > To get familiar with the many options, run this to get help: > `find_node(0)->dump_bfs(0,0,"h")` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **Exploring loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: default change: if have none of cdmxo we put them all in ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/fcdb4335..160153c4 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=25 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=24-25 Stats: 14 lines in 1 file changed: 14 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From epeter at openjdk.java.net Thu Jun 2 13:58:30 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Thu, 2 Jun 2022 13:58:30 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v27] In-Reply-To: References: Message-ID: > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > To get familiar with the many options, run this to get help: > `find_node(0)->dump_bfs(0,0,"h")` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **Exploring loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: write out chained assignment ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/160153c4..6cfe0e1e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=26 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=25-26 Stats: 5 lines in 1 file changed: 4 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From fgao at openjdk.java.net Thu Jun 2 14:02:55 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 2 Jun 2022 14:02:55 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v6] In-Reply-To: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: > After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. > > After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). > > After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. > > Here is the test data (-XX:+UseSuperWord) on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - Implement an interface for auto-vectorization to consult supported match rules Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 - Merge branch 'master' into fg8283091 Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd - Merge branch 'master' into fg8283091 Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f - Merge branch 'master' into fg8283091 Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 - Add micro-benchmark cases Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 - Merge branch 'master' into fg8283091 Change-Id: I674581135fd0844accc65520574fcef161eededa - 8283091: Support type conversion between different data sizes in SLP After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: int <-> double float <-> long int <-> long float <-> double A typical test case: int[] a; double[] b; for (int i = start; i < limit; i++) { b[i] = (double) a[i]; } Our expected OptoAssembly code for one iteration is like below: add R12, R2, R11, LShiftL #2 vector_load V16,[R12, #16] vectorcast_i2d V16, V16 # convert I to D vector add R11, R1, R11, LShiftL #3 # ptr add R13, R11, #16 # ptr vector_store [R13], V16 To enable the vectorization, the patch solves the following problems in the SLP. There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use. After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. Here is the test data on NEON: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op perf data on X86: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op perf data on AVX512: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef ------------- Changes: https://git.openjdk.java.net/jdk/pull/7806/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=05 Stats: 1269 lines in 22 files changed: 1212 ins; 13 del; 44 mod Patch: https://git.openjdk.java.net/jdk/pull/7806.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7806/head:pull/7806 PR: https://git.openjdk.java.net/jdk/pull/7806 From fgao at openjdk.java.net Thu Jun 2 14:09:21 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 2 Jun 2022 14:09:21 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Wed, 25 May 2022 01:13:36 GMT, Fei Gao wrote: >> @fg1417 Thank you for suggesting this optimization. I see that it was not updated for some time. Do you still intend to work on it? >> >> Please update to latest JDK (Loom was integrated) and run performance again. Also include % of changes. >> >> I have the same concern as @DamonFool about regression when vectorizing some conversions. May be we should have additional `Matcher` property we could consult when trying to **auto-vectorize**. I understand that we need `vcvt2Dto2I` when VectorAPI specifically asking to generate it but we should not enforce auto-generation. > >> Please update to latest JDK (Loom was integrated) and run performance again. Also include % of changes. >> >> I have the same concern as @DamonFool about regression when vectorizing some conversions. May be we should have additional `Matcher` property we could consult when trying to **auto-vectorize**. I understand that we need `vcvt2Dto2I` when VectorAPI specifically asking to generate it but we should not enforce auto-generation. > > @vnkozlov thanks for your review and kind suggestion! I'll update the patch to resolve the potential performance regression. > @fg1417 I don't see new update in this PR. Please also show performance numbers with new changes Here is the perf uplift data (ns/op) on different machines for the latest patch. NEON perf change (ns/op) convertB2D not supported convertB2F -45.55% convertB2L not supported convertD2B not supported convertD2F -42.32% convertD2I not supported (VectorAPI supported) convertD2S not supported convertF2B -42.95% convertF2D -45.28% convertF2L -5.78% convertF2S -51.30% convertI2D -27.82% convertI2L -44.54% convertL2B not supported convertL2F not supported (VectorAPI supported) convertL2I -28.58% convertL2S not supported convertS2D not supported convertS2F -53.37% convertS2L not supported SVE perf change (ns/op) convertB2D -36.15% convertB2F -63.48% convertB2L -32.48% convertD2B 0.02% convertD2F -47.85% convertD2I -46.42% convertD2S -32.08% convertF2B -59.54% convertF2D -60.81% convertF2L -61.81% convertF2S -67.67% convertI2D -60.63% convertI2L -57.23% convertL2B 0.04% convertL2F -47.21% convertL2I -34.49% convertL2S -19.57% convertS2D -47.20% convertS2F -74.86% convertS2L -49.00% X86 perf change (ns/op) convertB2D -64.13% convertB2F -79.37% convertB2L -70.97% convertD2B not supported convertD2F -62.69% convertD2I not supported convertD2S not supported convertF2B not supported convertF2D -68.90% convertF2L not supported convertF2S not supported convertI2D -87.48% convertI2L -69.64% convertL2B -3.96% convertL2F -0.11% convertL2I -49.59% convertL2S -24.75% convertS2D -84.35% convertS2F -86.09% convertS2L -70.42% AVX512 perf change (ns/op) convertB2D -78.08% convertB2F -86.39% convertB2L -79.07% convertD2B not supported convertD2F -71.86% convertD2I not supported convertD2S not supported convertF2B not supported convertF2D -78.17% convertF2L not supported convertF2S not supported convertI2D -90.26% convertI2L -79.92% convertL2B -70.75% convertL2F -86.67% convertL2I -80.94% convertL2S -71.54% convertS2D -90.84% convertS2F -83.94% convertS2L -80.51% ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From fgao at openjdk.java.net Thu Jun 2 14:09:21 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 2 Jun 2022 14:09:21 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Thu, 2 Jun 2022 14:02:33 GMT, Fei Gao wrote: > @fg1417 I don't see new update in this PR. Please also show performance numbers with new changes @vnkozlov I updated the patch to resolve the potential performance regression and also updated the performance numbers in the comment. But the patch depends on the fix, [8287517: C2: assert(vlen_in_bytes == 64) failed: 2 by sviswa7 ? Pull Request #8961 ? openjdk/jdk (github.com)](https://github.com/openjdk/jdk/pull/8961), because when we enable the type conversion and loop induction at the same time, we can vectorize more scenarios like https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java#L723. When we set `MaxVectorSize=16`, the case here would fail. All jtreg tests passed except that one. Please help review. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From fgao at openjdk.java.net Thu Jun 2 14:16:35 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 2 Jun 2022 14:16:35 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> Message-ID: On Thu, 2 Jun 2022 04:37:58 GMT, Sandhya Viswanathan wrote: >> I think we missed the test with setting `MaxVectorSize` to 32 (vs 64) on Cascade Lake CPU. We should do that. >> >> That may be preferable "simple fix" vs suggested changes for "short term solution". >> >> The objection was that user may still want to use wide 64 bytes vectors for Vector API. But I agree with Jatin argument about that. >> Limiting `MaxVectorSize` **will** affect our intrinsics/stubs code and may affect performance. That is why we need to test it. I will ask Eric. >> >> BTW, `SuperWordMaxVectorSize` should be diagnostic or experimental since it is temporary solution. > > @vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake. Hi @sviswa7 , https://github.com/openjdk/jdk/pull/7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case? ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From chagedorn at openjdk.java.net Thu Jun 2 14:18:40 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 2 Jun 2022 14:18:40 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v27] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 13:58:30 GMT, Emanuel Peter wrote: >> **What this gives you for the debugger** >> - BFS traversal (inputs / outputs) >> - node filtering by category >> - shortest path between nodes >> - all paths between nodes >> - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) >> - and more >> >> **Some usecases** >> - more readable `dump` >> - follow only nodes of some categories (only control, only data, etc) >> - find which control nodes depend on data node (visit data nodes, include control in boundary) >> - how two nodes relate (shortest / all paths, following input/output nodes, or both) >> - find loops (control / memory / data: call all paths with node as start and target) >> >> **Description** >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. >> >> `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` >> >> To get familiar with the many options, run this to get help: >> `find_node(0)->dump_bfs(0,0,"h")` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. >> >> **Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. >> 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. >> 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. >> 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. >> 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. >> 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. >> 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! >> >> Example (BFS inputs): >> >> (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") >> d dump >> --------------------------------------------- >> 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> >> >> Example (BFS control inputs): >> >> (rr) p find_node(163)->dump_bfs(5,0,"c+") >> d dump >> --------------------------------------------- >> 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 >> >> We see the control flow of a strip mined loop. >> >> >> Experiment (BFS only data, but display all nodes on boundary) >> >> (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") >> d dump >> --------------------------------------------- >> 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) >> 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] >> >> We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). >> >> Example with Mach nodes: >> >> (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") >> d [head idom d] old dump >> --------------------------------------------- >> 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd >> 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) >> 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] >> 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] >> 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> And the query on the old nodes: >> >> (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") >> d dump >> --------------------------------------------- >> 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] >> 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] >> 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 >> 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] >> 1 o740 Bool === _ o739 [[ o741 ]] [lt] >> 1 o179 IfTrue === o178 [[ o741 ]] #1 >> 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 >> >> >> **Exploring loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "c+")` >> This provides us with a shortest control path, given this path has a distance of at most 20. >> >> Example (shortest path over control nodes): >> >> (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") >> d dump >> --------------------------------------------- >> 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> >> Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. >> >> Example (all paths between two nodes): >> >> (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. >> >> An alternative to detect loops quickly, is running an all paths query from a node to itself: >> >> Example (loop detection with all paths): >> >> (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") >> d apd dump >> --------------------------------------------- >> 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) >> 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We get the loop control, plus the loop-back `190 IfTrue`. >> >> Example (loop detection with all paths for phi): >> >> (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> >> >> **Color examples** >> Colors are especially useful to see chains between nodes (options character `#`). >> The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. >> Tip: it can be worth it to configure the colors of your terminal to be more appealing. >> >> Example (find control dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) >> We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. >> >> Example (find memory dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) >> >> Example (loop detection): >> ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) >> We find the control and some data loop paths. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > write out chained assignment These are cool new dumping features! I especially like the coloring and the filtering which follow the coloring and filtering of IGV. The implementation looks good apart from some minor code styles things. Thanks for addressing all the offline update suggestions! @vnkozlov should also review it again as his review applies to an old state which is now quite different. src/hotspot/share/opto/callnode.hpp line 659: > 657: > 658: #ifndef PRODUCT > 659: virtual void dump_req(outputStream *st = tty, DumpConfig* dc = nullptr) const; `*` should be left at the type: Suggestion: virtual void dump_req(outputStream* st = tty, DumpConfig* dc = nullptr) const; src/hotspot/share/opto/node.cpp line 1770: > 1768: Node* old_node(Node* n); // mach node -> prior IR node > 1769: void print_node_idx(Node* n); // to tty > 1770: void print_node_block(Node* n); // to tty: head idx, _idom, _dom_depth Can all be made `static`. src/hotspot/share/opto/node.cpp line 2327: > 2325: > 2326: // -----------------------------dump_idx--------------------------------------- > 2327: void Node::dump_idx(bool align, outputStream *st, DumpConfig* dc) const { Suggestion: void Node::dump_idx(bool align, outputStream* st, DumpConfig* dc) const { src/hotspot/share/opto/node.cpp line 2333: > 2331: Compile* C = Compile::current(); > 2332: bool is_new = C->node_arena()->contains(this); > 2333: if(align) { // print prefix empty spaces$ Suggestion: if (align) { // print prefix empty spaces$ src/hotspot/share/opto/node.cpp line 2338: > 2336: // +1 for leading digit, maybe +1 for "o" > 2337: uint width = log10(_idx) + 1 + (is_new ? 0 : 1); > 2338: while(max_width > width) { Suggestion: while (max_width > width) { src/hotspot/share/opto/node.cpp line 2343: > 2341: } > 2342: } > 2343: if(!is_new) { Suggestion: if (!is_new) { src/hotspot/share/opto/node.cpp line 2353: > 2351: > 2352: // -----------------------------dump_name-------------------------------------- > 2353: void Node::dump_name(outputStream *st, DumpConfig* dc) const { Suggestion: void Node::dump_name(outputStream* st, DumpConfig* dc) const { ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8468 From rkennke at openjdk.java.net Thu Jun 2 14:42:32 2022 From: rkennke at openjdk.java.net (Roman Kennke) Date: Thu, 2 Jun 2022 14:42:32 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. In-Reply-To: References: Message-ID: On Tue, 31 May 2022 14:46:58 GMT, Roland Westrelin wrote: > With JDK-8277654, the load barrier slow path call doesn't produce raw > memory anymore but the IU barrier call still does. I propose removing > raw memory for that call too which also causes the assert that fails > to be removed. Is it correct, though? I seem to remember that without the memory edges, we may get reordering of the 'SATB' buffer and index accesses between IU-barriers, which would cause troubles? ------------- PR: https://git.openjdk.java.net/jdk/pull/8958 From epeter at openjdk.java.net Thu Jun 2 14:48:36 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Thu, 2 Jun 2022 14:48:36 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v28] In-Reply-To: References: Message-ID: <6fvXf0Lpbpbs0fQXNQLimROBnrAUIfJrUoHv3Fd7AkE=.0375bdec-8c24-4f30-9ccc-17095ed973ec@github.com> > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > To get familiar with the many options, run this to get help: > `find_node(0)->dump_bfs(0,0,"h")` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **Exploring loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - missing style thing from last commit - another one of Christian's reviews ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/6cfe0e1e..63e25056 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=27 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=26-27 Stats: 19 lines in 3 files changed: 0 ins; 0 del; 19 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From epeter at openjdk.java.net Thu Jun 2 14:48:44 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Thu, 2 Jun 2022 14:48:44 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v27] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 14:02:53 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> write out chained assignment > > src/hotspot/share/opto/callnode.hpp line 659: > >> 657: >> 658: #ifndef PRODUCT >> 659: virtual void dump_req(outputStream *st = tty, DumpConfig* dc = nullptr) const; > > `*` should be left at the type: > Suggestion: > > virtual void dump_req(outputStream* st = tty, DumpConfig* dc = nullptr) const; done > src/hotspot/share/opto/node.cpp line 1770: > >> 1768: Node* old_node(Node* n); // mach node -> prior IR node >> 1769: void print_node_idx(Node* n); // to tty >> 1770: void print_node_block(Node* n); // to tty: head idx, _idom, _dom_depth > > Can all be made `static`. done > src/hotspot/share/opto/node.cpp line 2327: > >> 2325: >> 2326: // -----------------------------dump_idx--------------------------------------- >> 2327: void Node::dump_idx(bool align, outputStream *st, DumpConfig* dc) const { > > Suggestion: > > void Node::dump_idx(bool align, outputStream* st, DumpConfig* dc) const { done > src/hotspot/share/opto/node.cpp line 2333: > >> 2331: Compile* C = Compile::current(); >> 2332: bool is_new = C->node_arena()->contains(this); >> 2333: if(align) { // print prefix empty spaces$ > > Suggestion: > > if (align) { // print prefix empty spaces$ done > src/hotspot/share/opto/node.cpp line 2338: > >> 2336: // +1 for leading digit, maybe +1 for "o" >> 2337: uint width = log10(_idx) + 1 + (is_new ? 0 : 1); >> 2338: while(max_width > width) { > > Suggestion: > > while (max_width > width) { done > src/hotspot/share/opto/node.cpp line 2343: > >> 2341: } >> 2342: } >> 2343: if(!is_new) { > > Suggestion: > > if (!is_new) { done > src/hotspot/share/opto/node.cpp line 2353: > >> 2351: >> 2352: // -----------------------------dump_name-------------------------------------- >> 2353: void Node::dump_name(outputStream *st, DumpConfig* dc) const { > > Suggestion: > > void Node::dump_name(outputStream* st, DumpConfig* dc) const { done ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From kvn at openjdk.java.net Thu Jun 2 15:42:13 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 15:42:13 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v3] In-Reply-To: References: Message-ID: <1KZNZmrZNxn_YDSTrosuyO7KjL2GaJEi8uiOO1ZZZCo=.7fc2d5a3-31a7-4e30-b278-5e705862c001@github.com> On Thu, 2 Jun 2022 04:34:27 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Change SuperWordMaxVectorSize to develop option src/hotspot/cpu/x86/vm_version_x86.cpp line 902: > 900: if (_stepping < 5) { > 901: FLAG_SET_DEFAULT(UseAVX, 2); > 902: } What is this change for? src/hotspot/cpu/x86/vm_version_x86.cpp line 1303: > 1301: if (FLAG_IS_DEFAULT(SuperWordMaxVectorSize)) { > 1302: if (FLAG_IS_DEFAULT(UseAVX) && UseAVX > 2 && > 1303: is_intel_skylake() && _stepping > 5) { Should you check `_stepping >= 5`? Otherwise `_stepping == 5` is missing in all adjustments. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From roland at openjdk.java.net Thu Jun 2 15:50:55 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 2 Jun 2022 15:50:55 GMT Subject: RFR: 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect Message-ID: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> It's another case where because of overunrolling, the main loop is never executed but not optimized out and the type of some CastII/ConvI2L for a range check conflicts with the type of its input resulting in a broken graph for the main loop. This is supposed to have been solved by skeleton predicates. There's indeed a predicate that should catch that the loop is unreachable but it doesn't constant fold. The shape of the predicate is: (CmpUL (SubL 15 (ConvI2L (AddI (CastII int:>=1) 15) minint..maxint)) 16) I propose adding a CastII, that is in this case: (CmpUL (SubL 15 (ConvI2L (CastII (AddI (CastII int:>=1) 15) 0..max-1) minint..maxint)) 16) The justification for the CastII is that the skeleton predicate is a predicate for a specific iteration of the loop. That iteration of the loop must be in the range of the iv Phi. With the extra CastII, the AddI can be pushed through the CastII and ConvI2L and the check constant folds. Actually, with the extra CastII, the predicate is not implemented with a CmpUL but a CmpU because the code can tell there's no risk of overflow (I did force the use of CmpUL as an experiment and the CmpUL does constant fold) ------------- Commit messages: - test & fix Changes: https://git.openjdk.java.net/jdk/pull/8996/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8996&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286625 Stats: 63 lines in 3 files changed: 62 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8996.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8996/head:pull/8996 PR: https://git.openjdk.java.net/jdk/pull/8996 From mdoerr at openjdk.java.net Thu Jun 2 16:15:40 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 2 Jun 2022 16:15:40 GMT Subject: RFR: 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing Message-ID: `PopCountVI` needs to handle several types after [JDK-8284960](https://bugs.openjdk.java.net/browse/JDK-8284960). (See https://github.com/openjdk/jdk/commit/6f6486e97743eadfb20b4175e1b4b2b05b59a17a.) ------------- Commit messages: - 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing Changes: https://git.openjdk.java.net/jdk/pull/8998/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8998&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287738 Stats: 29 lines in 3 files changed: 25 ins; 1 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8998.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8998/head:pull/8998 PR: https://git.openjdk.java.net/jdk/pull/8998 From kvn at openjdk.java.net Thu Jun 2 16:59:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 16:59:34 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 In-Reply-To: References: Message-ID: <-OVRPLejI69eYbfH0DFnp4P0mMhTU9d1QoRCl_arkDk=.437fb8c5-bbde-4e56-81f9-5f008a621037@github.com> On Tue, 31 May 2022 23:02:18 GMT, Sandhya Viswanathan wrote: > Fixed the assertion in load_iota_indices when the length passed is less than 4. > Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. Looks good. Waiting new test. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From kvn at openjdk.java.net Thu Jun 2 17:01:19 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 17:01:19 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Thu, 2 Jun 2022 14:04:51 GMT, Fei Gao wrote: > . When we set `MaxVectorSize=16`, the case here would fail. All jtreg tests passed except that one. Do you mean it fail without #8961 or it fail always? ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From kvn at openjdk.java.net Thu Jun 2 17:04:24 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 17:04:24 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> Message-ID: On Thu, 2 Jun 2022 04:37:58 GMT, Sandhya Viswanathan wrote: >> I think we missed the test with setting `MaxVectorSize` to 32 (vs 64) on Cascade Lake CPU. We should do that. >> >> That may be preferable "simple fix" vs suggested changes for "short term solution". >> >> The objection was that user may still want to use wide 64 bytes vectors for Vector API. But I agree with Jatin argument about that. >> Limiting `MaxVectorSize` **will** affect our intrinsics/stubs code and may affect performance. That is why we need to test it. I will ask Eric. >> >> BTW, `SuperWordMaxVectorSize` should be diagnostic or experimental since it is temporary solution. > > @vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake. > Hi @sviswa7 , #7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case? Maybe. But it would require more careful changes. And that changeset is not integrated yet. Current changes are clean and serve their purpose good. And, as Jatin and Sandhya said, we may do proper fix after JDK 19 fork. Then we can look on your proposal. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From duke at openjdk.java.net Thu Jun 2 17:10:23 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 2 Jun 2022 17:10:23 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: On Thu, 2 Jun 2022 01:33:25 GMT, Srinivas Vamsi Parasa wrote: > compiler/c2/irTests/TestScheduleSmallMethod.java failed in Tier1 (case #1 is run with "-XX:-OptoScheduling"): > > ``` > compiler.lib.ir_framework.shared.TestRunException: The following scenarios have failed: #1. Please check stderr for more information. > at compiler.lib.ir_framework.TestFramework.reportScenarioFailures(TestFramework.java:617) > at compiler.lib.ir_framework.TestFramework.startWithScenarios(TestFramework.java:578) > at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:335) > at compiler.c2.irTests.TestScheduleSmallMethod.main(TestScheduleSmallMethod.java:44) > ``` > > The test use Doube arithmetic. Hi Vladimir, wanted to check with you if this test is still a problem and needs to be fixed? ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Thu Jun 2 17:19:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 17:19:35 GMT Subject: RFR: 8287425: Remove unnecessary register push for MacroAssembler::check_klass_subtype_slow_path In-Reply-To: References: Message-ID: On Fri, 27 May 2022 09:10:48 GMT, Xiaolin Zheng wrote: > Hi team, > > ![AE98A8E7-9F6F-4722-B310-299A9A96A957](https://user-images.githubusercontent.com/38156692/170670906-2ce37a13-af21-4cf8-acbd-ca24528bc3a9.png) > > Some perf results show unnecessary pushes in `MacroAssembler::check_klass_subtype_slow_path()` under `UseCompressedOops`. History logs show the original code is like [1], and it gets refactored in [JDK-6813212](https://bugs.openjdk.java.net/browse/JDK-6813212), and the counterparts of the `UseCompressedOops` in the diff are at [2] and [3], and we could see the push of rax is just because `encode_heap_oop_not_null()` would kill it, so here needs a push and restore. After that, [JDK-6964458](https://bugs.openjdk.java.net/browse/JDK-6964458) (removal of perm gen) at [4] removed [3] so that there is no need to do UseCompressedOops work in `MacroAssembler::check_klass_subtype_slow_path()`; but in that patch [2] didn't get removed, so we finally come here. As a result, [2] could also be safely removed. > > (Files in [4] are folded because the patch is too large. We could manually unfold `hotspot/src/cpu/x86/vm/assembler_x86.cpp` to see that diff) > > I was wondering if this minor change could be sponsored? > > This enhancement is raised on behalf of Wei Kuai . > > Tested x86_64 hotspot tier1~tier4 twice, aarch64 hotspot tier1~tier4 once with another jdk tier1 once, and riscv64 hotspot tier1~tier4 once. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/de67e5294982ce197f2abd051cbb1c8aa6c29499/hotspot/src/cpu/x86/vm/interp_masm_x86_64.cpp#L273-L284 > [2] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7441-R7444 > [3] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7466-R7477 > [4] https://github.com/openjdk/jdk/commit/5c58d27aac7b291b879a7a3ff6f39fca25619103#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3L9347-L9361 Let me test it first. ------------- PR: https://git.openjdk.java.net/jdk/pull/8915 From duke at openjdk.java.net Thu Jun 2 17:23:56 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Thu, 2 Jun 2022 17:23:56 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. Message-ID: Hi All, Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. Patch extends existing @IR annotation with following two new options:- - applyIfTargetFeatureAnd: Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. Example usage: @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) Please review and share your feedback. Thanks, Swati ------------- Commit messages: - 8287525: Extend IR annotation with new options to test specific target feature. Changes: https://git.openjdk.java.net/jdk/pull/8999/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287525 Stats: 184 lines in 4 files changed: 182 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8999.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8999/head:pull/8999 PR: https://git.openjdk.java.net/jdk/pull/8999 From kvn at openjdk.java.net Thu Jun 2 17:24:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 17:24:34 GMT Subject: RFR: 8285868: x86 intrinsics for floating point method isInfinite [v14] In-Reply-To: References: <7a7UIHrziQ4Gt-1X-peOYHw7Wx08A5eGTEOovI7Q1t0=.ff367d60-9a49-4fa1-ae24-33e24bae76b6@github.com> Message-ID: <106Ty9KpPU_6Yj3iuTXcY0hVpfigHdsINz9kuwxoD8U=.3ac83729-c88f-48a1-9b51-45b2d7cb9e39@github.com> On Thu, 2 Jun 2022 17:06:58 GMT, Srinivas Vamsi Parasa wrote: > Hi Vladimir, wanted to check with you if this test is still a problem and needs to be fixed? No. You can push since you have 2 approvals. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Thu Jun 2 17:27:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 17:27:35 GMT Subject: RFR: 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 16:08:11 GMT, Martin Doerr wrote: > `PopCountVI` needs to handle several types after [JDK-8284960](https://bugs.openjdk.java.net/browse/JDK-8284960). > (See https://github.com/openjdk/jdk/commit/6f6486e97743eadfb20b4175e1b4b2b05b59a17a.) I am not expert of PPC64 but changes look good to me. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8998 From sviswanathan at openjdk.java.net Thu Jun 2 17:32:28 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 17:32:28 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v3] In-Reply-To: <1KZNZmrZNxn_YDSTrosuyO7KjL2GaJEi8uiOO1ZZZCo=.7fc2d5a3-31a7-4e30-b278-5e705862c001@github.com> References: <1KZNZmrZNxn_YDSTrosuyO7KjL2GaJEi8uiOO1ZZZCo=.7fc2d5a3-31a7-4e30-b278-5e705862c001@github.com> Message-ID: On Thu, 2 Jun 2022 15:37:16 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Change SuperWordMaxVectorSize to develop option > > src/hotspot/cpu/x86/vm_version_x86.cpp line 902: > >> 900: if (_stepping < 5) { >> 901: FLAG_SET_DEFAULT(UseAVX, 2); >> 902: } > > What is this change for? I had some changes in this area before. This is an artifact of that. I will set it back to exactly as it was. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From duke at openjdk.java.net Thu Jun 2 17:46:41 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 2 Jun 2022 17:46:41 GMT Subject: Integrated: 8285868: x86 intrinsics for floating point method isInfinite In-Reply-To: References: Message-ID: On Thu, 28 Apr 2022 23:02:47 GMT, Srinivas Vamsi Parasa wrote: > We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. > > > Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) > FloatClassCheck.testIsFinite 0.562 0.406 28% > FloatClassCheck.testIsInfinite 0.815 0.383 53% > FloatClassCheck.testIsNaN 0.63 0.382 39% > DoubleClassCheck.testIsFinite 0.565 0.409 28% > DoubleClassCheck.testIsInfinite 0.812 0.375 54% > DoubleClassCheck.testIsNaN 0.631 0.38 40% > FPComparison.isFiniteDouble 332.638 272.577 18% > FPComparison.isFiniteFloat 413.217 331.825 20% > FPComparison.isInfiniteDouble 874.897 240.632 72% > FPComparison.isInfiniteFloat 872.279 321.269 63% > FPComparison.isNanDouble 286.566 240.36 16% > FPComparison.isNanFloat 346.123 316.923 8% This pull request has now been integrated. Changeset: 7f44f572 Author: vamsi-parasa Committer: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/7f44f572ea451a1f38b446a6ef64ffb27e3eb3fe Stats: 513 lines in 18 files changed: 513 ins; 0 del; 0 mod 8285868: x86 intrinsics for floating point method isInfinite Reviewed-by: kvn, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From sviswanathan at openjdk.java.net Thu Jun 2 17:49:04 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 17:49:04 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v4] In-Reply-To: References: Message-ID: > We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. > The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. > The performance regression is due to auto-vectorization of small loops. > We don?t have AVX3Threshold consideration in auto-vectorization. > The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. > > This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. > > Please review. > > Best Regard, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Review comment resolution ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8877/files - new: https://git.openjdk.java.net/jdk/pull/8877/files/e8ea837a..42085160 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8877&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8877&range=02-03 Stats: 8 lines in 2 files changed: 0 ins; 3 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/8877.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8877/head:pull/8877 PR: https://git.openjdk.java.net/jdk/pull/8877 From sviswanathan at openjdk.java.net Thu Jun 2 17:49:04 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 17:49:04 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> Message-ID: On Thu, 2 Jun 2022 17:01:26 GMT, Vladimir Kozlov wrote: >> @vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake. > >> Hi @sviswa7 , #7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case? > > Maybe. But it would require more careful changes. And that changeset is not integrated yet. > Current changes are clean and serve their purpose good. > > And, as Jatin and Sandhya said, we may do proper fix after JDK 19 fork. Then we can look on your proposal. @vnkozlov @jatin-bhateja Your review comments are implemented. Please take a look. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Thu Jun 2 18:03:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 18:03:38 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v4] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 17:49:04 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review comment resolution Looks good. Please wait until regression and performance testing are finished. I will let you know results. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8877 From xliu at openjdk.java.net Thu Jun 2 18:17:25 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 2 Jun 2022 18:17:25 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v9] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: move preprocess() after remove Useless. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/d7e5f062..f6771d69 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=08 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=07-08 Stats: 24 lines in 4 files changed: 10 ins; 6 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Thu Jun 2 18:20:25 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 18:20:25 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. In-Reply-To: References: Message-ID: <1Bnxf2zf9ay2IdCI8_HbfrYuN2q3zBmDjrpgutE4xyM=.7fbd526f-e729-4305-8db2-ca7aef2a9169@github.com> On Thu, 2 Jun 2022 17:17:21 GMT, Swati Sharma wrote: > Hi All, > > Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. > > This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. > > Patch extends existing @IR annotation with following two new options:- > > - applyIfTargetFeatureAnd: > Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. > - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. > > Example usage: > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) > > Please review and share your feedback. > > Thanks, > Swati test/hotspot/jtreg/compiler/lib/ir_framework/IR.java line 106: > 104: * IR verifications checks are enforced only if all the specified feature constraints are met. > 105: */ > 106: String[] applyIfTargetFeatureAnd() default {}; Why you used `Target` instead of original `CPU` (you check output of `getCPUFeatures()` only)? Do you plan to extend this to check flags too in a future? test/hotspot/jtreg/compiler/lib/ir_framework/IR.java line 110: > 108: /** > 109: * Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false > 110: * value where a true value necessities existence of target feature and vice-versa. I don't think you need to repeat the same 2 lines. ------------- PR: https://git.openjdk.java.net/jdk/pull/8999 From xliu at openjdk.java.net Thu Jun 2 18:22:34 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 2 Jun 2022 18:22:34 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v7] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 19:39:14 GMT, Vladimir Kozlov wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove useless flag. if jdwp is on, liveness_at_bci() marks all local >> variables live. > > src/hotspot/share/opto/compile.cpp line 1864: > >> 1862: } >> 1863: >> 1864: void Compile::invalidate_unstable_if(CallStaticJavaNode* unc) { > > Add description comment for this method. What it is used for. updated it. also rename it to remove_unstable_if(). ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Thu Jun 2 18:22:36 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 2 Jun 2022 18:22:36 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v7] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 21:00:15 GMT, Vladimir Kozlov wrote: >> Yes, this is for 8287385. I will remove preprocessing logic from this PR. >> >> Because we can determine a unstable_if trap is trivial after parsing. My idea is to do the leftover parsing job in this preprocess function. That's why I think preprocess should before `PhaseRemoveUseless`. > >> Yes, this is for 8287385. I will remove preprocessing logic from this PR. > > Okay. > >> Because we can determine a unstable_if trap is trivial after parsing. My idea is to do the leftover parsing job in this preprocess function. That's why I think preprocess should before `PhaseRemoveUseless`. > > Then you should consider my proposal about cleaning the list and update counters in `Compile::remove_useless_nodes()`. Let discuss it in next changes. I move preprocess after PhaseRemoveUseless phase. I think it's reasonable to count only useful nodes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Thu Jun 2 18:31:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 18:31:35 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v9] In-Reply-To: References: Message-ID: <6_yCWyxR5Arhkq8CK_34O4LsJheUq-dPt0EI7PKsJ1M=.314a24ca-3334-40b8-a5c1-be4b7068f9cb@github.com> On Thu, 2 Jun 2022 18:17:25 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > move preprocess() after remove Useless. Looks good. Could you consider also rename all methods to `*_unstable_if_traps()`? ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Thu Jun 2 18:39:41 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 2 Jun 2022 18:39:41 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v9] In-Reply-To: <6_yCWyxR5Arhkq8CK_34O4LsJheUq-dPt0EI7PKsJ1M=.314a24ca-3334-40b8-a5c1-be4b7068f9cb@github.com> References: <6_yCWyxR5Arhkq8CK_34O4LsJheUq-dPt0EI7PKsJ1M=.314a24ca-3334-40b8-a5c1-be4b7068f9cb@github.com> Message-ID: On Thu, 2 Jun 2022 18:28:18 GMT, Vladimir Kozlov wrote: > Looks good. Could you consider also rename all methods to *_unstable_if_traps()? sure. I will rename them. thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Thu Jun 2 18:39:39 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 2 Jun 2022 18:39:39 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 07:16:53 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request incrementally with 11 additional commits since the last revision: >> >> - revert code change from 1st revision. >> - Merge branch 'JDK-8276998' into JDK-8286104 >> - rule out if a If nodes has 2 branches of unstable_if trap. >> - change the flag to diagnostic. >> - add sanity check for operands if bc is if_acmp_eq/ne and ifnull/nonnull >> - fix release build >> - update unstable_if after igvn. >> - adjust unstable_if after fold_compares >> - disable comparison_folding temporarily. >> >> This feature not only folds two CMPI but also merge two uncommon_traps. >> it uses the dominating uncommon_trap and revaluate the two if in >> interpreter. currently, aggressiveliveness can't work for that. >> - retain bci for unstable_if >> - ... and 1 more: https://git.openjdk.java.net/jdk/compare/2c38b87b...2f047457 > > I ran this through some quick testing and `test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java` fails: > > java.lang.reflect.InvocationTargetException > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:116) > at java.base/java.lang.reflect.Method.invoke(Method.java:578) > at compiler.rangechecks.TestExplicitRangeChecks.doTest(TestExplicitRangeChecks.java:441) > at compiler.rangechecks.TestExplicitRangeChecks.main(TestExplicitRangeChecks.java:518) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:578) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) > at java.base/java.lang.Thread.run(Thread.java:1585) > Caused by: java.lang.NullPointerException: Cannot read the array length because "" is null > at compiler.rangechecks.TestExplicitRangeChecks.test3_2(TestExplicitRangeChecks.java:113) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > ... 7 more @TobiHartmann I studied that failure. It's still from the conflict with 'fold-compares'. Some compare nodes look like range check, so c2 takes a record and postpone 'fold-compares' transformation after loop optimizations. I killed locals of the uncommon_trap prematurely. My new solution: disqualify the uncommon_trap when `has_only_uncommon_traps()` is about to return true. It returns true when 2 if nodes both have uncommon traps and c2 will fuse them. I disqualify the dominating one and the other one will be dead. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From jbhateja at openjdk.java.net Thu Jun 2 19:02:33 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 2 Jun 2022 19:02:33 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v4] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 17:49:04 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review comment resolution Marked as reviewed by jbhateja (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From jbhateja at openjdk.java.net Thu Jun 2 19:02:35 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 2 Jun 2022 19:02:35 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> Message-ID: On Thu, 2 Jun 2022 17:44:54 GMT, Sandhya Viswanathan wrote: >>> Hi @sviswa7 , #7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case? >> >> Maybe. But it would require more careful changes. And that changeset is not integrated yet. >> Current changes are clean and serve their purpose good. >> >> And, as Jatin and Sandhya said, we may do proper fix after JDK 19 fork. Then we can look on your proposal. > > @vnkozlov @jatin-bhateja Your review comments are implemented. Please take a look. Thanks @sviswa7 , changes looks good to me. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From xliu at openjdk.java.net Thu Jun 2 19:16:45 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 2 Jun 2022 19:16:45 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v10] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 25 additional commits since the last revision: - Merge branch 'master' into JDK-8286104 - Remame all methods to _unstable_if_trap(s) and group them. - move preprocess() after remove Useless. - Refactor per reviewer's feedback. - Remove useless flag. if jdwp is on, liveness_at_bci() marks all local variables live. - support option AggressiveLivessForUnstableIf - Merge branch 'master' into JDK-8286104 - update comments. - Merge branch 'master' into JDK-8286104 - reimplement process_unstable_ifs - ... and 15 more: https://git.openjdk.java.net/jdk/compare/6611eef0...4130cd10 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/f6771d69..4130cd10 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=09 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=08-09 Stats: 52361 lines in 648 files changed: 26567 ins; 19587 del; 6207 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Thu Jun 2 20:32:33 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 20:32:33 GMT Subject: RFR: 8287425: Remove unnecessary register push for MacroAssembler::check_klass_subtype_slow_path In-Reply-To: References: Message-ID: <3C7q08ifvL-ZLV0P6_PEMXCraisXrD1y5aKKayk4e7E=.ac8eed77-d2bf-4780-b6ea-920599c49715@github.com> On Fri, 27 May 2022 09:10:48 GMT, Xiaolin Zheng wrote: > Hi team, > > ![AE98A8E7-9F6F-4722-B310-299A9A96A957](https://user-images.githubusercontent.com/38156692/170670906-2ce37a13-af21-4cf8-acbd-ca24528bc3a9.png) > > Some perf results show unnecessary pushes in `MacroAssembler::check_klass_subtype_slow_path()` under `UseCompressedOops`. History logs show the original code is like [1], and it gets refactored in [JDK-6813212](https://bugs.openjdk.java.net/browse/JDK-6813212), and the counterparts of the `UseCompressedOops` in the diff are at [2] and [3], and we could see the push of rax is just because `encode_heap_oop_not_null()` would kill it, so here needs a push and restore. After that, [JDK-6964458](https://bugs.openjdk.java.net/browse/JDK-6964458) (removal of perm gen) at [4] removed [3] so that there is no need to do UseCompressedOops work in `MacroAssembler::check_klass_subtype_slow_path()`; but in that patch [2] didn't get removed, so we finally come here. As a result, [2] could also be safely removed. > > (Files in [4] are folded because the patch is too large. We could manually unfold `hotspot/src/cpu/x86/vm/assembler_x86.cpp` to see that diff) > > I was wondering if this minor change could be sponsored? > > This enhancement is raised on behalf of Wei Kuai . > > Tested x86_64 hotspot tier1~tier4 twice, aarch64 hotspot tier1~tier4 once with another jdk tier1 once, and riscv64 hotspot tier1~tier4 once. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/de67e5294982ce197f2abd051cbb1c8aa6c29499/hotspot/src/cpu/x86/vm/interp_masm_x86_64.cpp#L273-L284 > [2] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7441-R7444 > [3] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7466-R7477 > [4] https://github.com/openjdk/jdk/commit/5c58d27aac7b291b879a7a3ff6f39fca25619103#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3L9347-L9361 Testing passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8915 From xlinzheng at openjdk.java.net Thu Jun 2 20:35:31 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Thu, 2 Jun 2022 20:35:31 GMT Subject: Integrated: 8287425: Remove unnecessary register push for MacroAssembler::check_klass_subtype_slow_path In-Reply-To: References: Message-ID: On Fri, 27 May 2022 09:10:48 GMT, Xiaolin Zheng wrote: > Hi team, > > ![AE98A8E7-9F6F-4722-B310-299A9A96A957](https://user-images.githubusercontent.com/38156692/170670906-2ce37a13-af21-4cf8-acbd-ca24528bc3a9.png) > > Some perf results show unnecessary pushes in `MacroAssembler::check_klass_subtype_slow_path()` under `UseCompressedOops`. History logs show the original code is like [1], and it gets refactored in [JDK-6813212](https://bugs.openjdk.java.net/browse/JDK-6813212), and the counterparts of the `UseCompressedOops` in the diff are at [2] and [3], and we could see the push of rax is just because `encode_heap_oop_not_null()` would kill it, so here needs a push and restore. After that, [JDK-6964458](https://bugs.openjdk.java.net/browse/JDK-6964458) (removal of perm gen) at [4] removed [3] so that there is no need to do UseCompressedOops work in `MacroAssembler::check_klass_subtype_slow_path()`; but in that patch [2] didn't get removed, so we finally come here. As a result, [2] could also be safely removed. > > (Files in [4] are folded because the patch is too large. We could manually unfold `hotspot/src/cpu/x86/vm/assembler_x86.cpp` to see that diff) > > I was wondering if this minor change could be sponsored? > > This enhancement is raised on behalf of Wei Kuai . > > Tested x86_64 hotspot tier1~tier4 twice, aarch64 hotspot tier1~tier4 once with another jdk tier1 once, and riscv64 hotspot tier1~tier4 once. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/de67e5294982ce197f2abd051cbb1c8aa6c29499/hotspot/src/cpu/x86/vm/interp_masm_x86_64.cpp#L273-L284 > [2] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7441-R7444 > [3] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7466-R7477 > [4] https://github.com/openjdk/jdk/commit/5c58d27aac7b291b879a7a3ff6f39fca25619103#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3L9347-L9361 This pull request has now been integrated. Changeset: b5a646ee Author: Xiaolin Zheng Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/b5a646ee6cfd432cef6b7e69a177959227a38ace Stats: 3 lines in 3 files changed: 0 ins; 0 del; 3 mod 8287425: Remove unnecessary register push for MacroAssembler::check_klass_subtype_slow_path Co-authored-by: Wei Kuai Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8915 From dlong at openjdk.java.net Thu Jun 2 21:50:26 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 2 Jun 2022 21:50:26 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7] In-Reply-To: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> References: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> Message-ID: On Sat, 16 Apr 2022 11:24:57 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits: > > - Resolve conflict > - ins_cost > - movzx is not elided with same input and output > - fix only the needs > - fix > - cisc > - delete benchmark command > - pipe > - fix, benchmarks > - pipe_class > - ... and 5 more: https://git.openjdk.java.net/jdk/compare/e5041ae3...337c0bf3 src/hotspot/cpu/x86/x86_64.ad line 10766: > 10764: format %{ "xorl $dst, $dst\t# ci2b\n\t" > 10765: "testl $src, $src\n\t" > 10766: "setnz $dst" %} What's the advantage of this change? The disadvantage is a spare TEMP register is needed -- we can't reuse src as dst. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From sviswanathan at openjdk.java.net Thu Jun 2 22:16:29 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 22:16:29 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> On Wed, 1 Jun 2022 02:13:47 GMT, Pengfei Li wrote: >> test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 36: >> >>> 34: * @key randomness >>> 35: * @summary Auto-vectorization enhancement for unsigned shift right on signed subword types >>> 36: * @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" >> >> This IR test for vectorizable check looks good on AArch64. But AFAIK, some operations cannot be vectorized on old x86 CPUs with AVX=1. Could you add something like `(os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*")` to check the CPU feature? > > @merykitty @sviswa7 Could you help confirm if byte/short shift operations are vectorizable with all AVX versions of x86? @pfustc They are available either directly or through a series of instructions for majority of the architectures. All implemented in x86.ad. The match_rule_supported and match_rule_supported_vector will appropriately return true/false. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From sviswanathan at openjdk.java.net Thu Jun 2 22:32:41 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 22:32:41 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'jdk:master' into JDK-8283667 > - Use integer constant for offsetInRange all the way through > - Rename "use_predicate" to "needs_predicate" > - Rename the "usePred" to "offsetInRange" > - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature Marked as reviewed by sviswanathan (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From sviswanathan at openjdk.java.net Thu Jun 2 22:32:42 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Jun 2022 22:32:42 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: <7BACvqeUZFJbVq36mElnVBWg2vXyN6kVUXYNKvJ7cuA=.a04e6924-006b-43f3-adec-97132d5a719d@github.com> References: <_c_QPZQIL-ZxBs9TaKmrh7_1WcbEDH1pUwhTpOc6PD8=.75e4a61b-ebb6-491c-9c5b-9a035f0b9eaf@github.com> <7BACvqeUZFJbVq36mElnVBWg2vXyN6kVUXYNKvJ7cuA=.a04e6924-006b-43f3-adec-97132d5a719d@github.com> Message-ID: On Thu, 2 Jun 2022 03:24:07 GMT, Xiaohong Gong wrote: >>> @XiaohongGong Could you please rebase the branch and resolve conflicts? >> >> Sure, I'm working on this now. The patch will be updated soon. Thanks. > >> > @XiaohongGong Could you please rebase the branch and resolve conflicts? >> >> Sure, I'm working on this now. The patch will be updated soon. Thanks. > > Resolved the conflicts. Thanks! @XiaohongGong You need one more review approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From fgao at openjdk.java.net Thu Jun 2 23:32:32 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 2 Jun 2022 23:32:32 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Thu, 2 Jun 2022 16:57:52 GMT, Vladimir Kozlov wrote: > Do you mean it fail without #8961 or it fail always? @vnkozlov ,after [JDK-8286972](https://bugs.openjdk.java.net/browse/JDK-8286972), [the case](https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java#L723) would fail without https://github.com/openjdk/jdk/pull/8961. With it, the case passes. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From kvn at openjdk.java.net Thu Jun 2 23:32:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Jun 2022 23:32:38 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v6] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Thu, 2 Jun 2022 14:02:55 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - 8283091: Support type conversion between different data sizes in SLP > > After JDK-8275317, C2's SLP vectorizer has supported type conversion > between the same data size. We can also support conversions between > different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems > in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and > StoreD. Assuming that the vector length is 128 bits, how many scalar > nodes should be packed together to a vector? If we decide it > separately for each operation node, like what we did before the patch > in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI > or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes > in a vector node sequence, like loading 4 elements to a vector, then > typecasting 2 elements and lastly storing these 2 elements, they become > invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function > SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. > In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then > generate valid vector node sequence, like loading 2 elements, > converting the 2 elements to another type and storing the 2 elements > with new type. > > After this, LoadI nodes don't make full use of the whole vector and > only occupy part of it. So we adapt the code in > SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each > scalar node in the whole vector. In this case, the alignments for 2 > LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. > Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which > mark that this node is the second node in the whole vector, while the > difference between 4 and 8 are just because of their own data sizes. In > this situation, we should try to remove the impact caused by different > data size in SLP. For example, in the stage of > SuperWord::extend_packlist(), while determining if it's potential to > pack a pair of def nodes in the function SuperWord::follow_use_defs(), > we remove the side effect of different data size by transforming the > target alignment from the use node. Because we believe that, assuming > that the vector length is 512 bits, if the ConvI2D use nodes have > alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, > these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type > conversion between different data size takes a type of one size and > produces a type of another size, hence the special checks on alignment > and size should be applied, like what we do in SuperWord::is_vector_use. > > After solving these problems, we successfully implemented the > vectorization of type conversion between different data sizes. > > Here is the test data on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef > > Do you mean it fail without #8961 or it fail always? > > @vnkozlov ,after [JDK-8286972](https://bugs.openjdk.java.net/browse/JDK-8286972), [the case](https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java#L723) would fail without #8961. With it, the case passes. Got it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From kvn at openjdk.java.net Fri Jun 3 00:03:32 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 00:03:32 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v6] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Thu, 2 Jun 2022 14:02:55 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - 8283091: Support type conversion between different data sizes in SLP > > After JDK-8275317, C2's SLP vectorizer has supported type conversion > between the same data size. We can also support conversions between > different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems > in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and > StoreD. Assuming that the vector length is 128 bits, how many scalar > nodes should be packed together to a vector? If we decide it > separately for each operation node, like what we did before the patch > in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI > or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes > in a vector node sequence, like loading 4 elements to a vector, then > typecasting 2 elements and lastly storing these 2 elements, they become > invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function > SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. > In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then > generate valid vector node sequence, like loading 2 elements, > converting the 2 elements to another type and storing the 2 elements > with new type. > > After this, LoadI nodes don't make full use of the whole vector and > only occupy part of it. So we adapt the code in > SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each > scalar node in the whole vector. In this case, the alignments for 2 > LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. > Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which > mark that this node is the second node in the whole vector, while the > difference between 4 and 8 are just because of their own data sizes. In > this situation, we should try to remove the impact caused by different > data size in SLP. For example, in the stage of > SuperWord::extend_packlist(), while determining if it's potential to > pack a pair of def nodes in the function SuperWord::follow_use_defs(), > we remove the side effect of different data size by transforming the > target alignment from the use node. Because we believe that, assuming > that the vector length is 512 bits, if the ConvI2D use nodes have > alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, > these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type > conversion between different data size takes a type of one size and > produces a type of another size, hence the special checks on alignment > and size should be applied, like what we do in SuperWord::is_vector_use. > > After solving these problems, we successfully implemented the > vectorization of type conversion between different data sizes. > > Here is the test data on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef src/hotspot/share/opto/matcher.hpp line 328: > 326: static const bool match_rule_supported(int opcode); > 327: > 328: // Identify extra cases that we might want to vectorize automatically. // And exclude cases which are not profitable to auto-vectorize. src/hotspot/share/opto/superword.cpp line 1003: > 1001: vw = MIN2(vectsize * type2aelembytes(btype), vw); > 1002: } > 1003: This have to be adjusted after #8877 is pushed. src/hotspot/share/opto/superword.cpp line 1461: > 1459: longer_type_for_conversion(t1) != T_ILLEGAL) { > 1460: align = align / data_size(s1) * data_size(t1); > 1461: } Put it into a separate function because this code pattern is used 2 times. src/hotspot/share/opto/superword.hpp line 575: > 573: BasicType longer_type_for_conversion(Node* n); > 574: // Find the longest type in def-use chain for packed nodes, and then compute the max vector size. > 575: int max_vector_size_in_ud_chain(Node* n); I prefer full words: `*_in_def_use_chain()` src/hotspot/share/opto/vectornode.cpp line 258: > 256: return Op_VectorCastF2X; > 257: case Op_ConvD2L: > 258: return Op_VectorCastD2X; Why you removed these lines? ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From sviswanathan at openjdk.java.net Fri Jun 3 00:34:10 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 00:34:10 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v2] In-Reply-To: References: Message-ID: > Fixed the assertion in load_iota_indices when the length passed is less than 4. > Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Add regression test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8961/files - new: https://git.openjdk.java.net/jdk/pull/8961/files/84973dba..1939a73a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8961&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8961&range=00-01 Stats: 65 lines in 1 file changed: 65 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8961.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8961/head:pull/8961 PR: https://git.openjdk.java.net/jdk/pull/8961 From sviswanathan at openjdk.java.net Fri Jun 3 00:34:10 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 00:34:10 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v2] In-Reply-To: <-OVRPLejI69eYbfH0DFnp4P0mMhTU9d1QoRCl_arkDk=.437fb8c5-bbde-4e56-81f9-5f008a621037@github.com> References: <-OVRPLejI69eYbfH0DFnp4P0mMhTU9d1QoRCl_arkDk=.437fb8c5-bbde-4e56-81f9-5f008a621037@github.com> Message-ID: <1oAfoJXZw2b4JKi3wYGvBwc2eHbEDy6YExLCw_mq7WE=.719081ec-9201-4d71-8ec6-3ca125a3f093@github.com> On Thu, 2 Jun 2022 16:56:08 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Add regression test > > Looks good. Waiting new test. @vnkozlov @chhagedorn @DamonFool I have added the regression test. Please take a look. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From jiefu at openjdk.java.net Fri Jun 3 00:41:30 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 3 Jun 2022 00:41:30 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v2] In-Reply-To: References: Message-ID: <247geSQ4hid8KzwDEk7Bj7fpPinBzDL48xrT0SOVfY8=.d42620f8-6ab3-4306-b800-131eb7a908bc@github.com> On Fri, 3 Jun 2022 00:34:10 GMT, Sandhya Viswanathan wrote: >> Fixed the assertion in load_iota_indices when the length passed is less than 4. >> Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Add regression test test/hotspot/jtreg/compiler/vectorization/cr8287517.java line 29: > 27: * @summary Test bug fix for JDK-8287517 related to fuzzer test failure in x86_64 > 28: * @requires vm.compiler2.enabled > 29: * @requires (os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*") | Maybe, we can remove this `require` so that other platforms can also get tested. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From sviswanathan at openjdk.java.net Fri Jun 3 00:44:32 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 00:44:32 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v6] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Thu, 2 Jun 2022 23:59:21 GMT, Vladimir Kozlov wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: >> >> - Implement an interface for auto-vectorization to consult supported match rules >> >> Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 >> - Merge branch 'master' into fg8283091 >> >> Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 >> - Add micro-benchmark cases >> >> Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I674581135fd0844accc65520574fcef161eededa >> - 8283091: Support type conversion between different data sizes in SLP >> >> After JDK-8275317, C2's SLP vectorizer has supported type conversion >> between the same data size. We can also support conversions between >> different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems >> in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and >> StoreD. Assuming that the vector length is 128 bits, how many scalar >> nodes should be packed together to a vector? If we decide it >> separately for each operation node, like what we did before the patch >> in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI >> or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes >> in a vector node sequence, like loading 4 elements to a vector, then >> typecasting 2 elements and lastly storing these 2 elements, they become >> invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function >> SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. >> In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then >> generate valid vector node sequence, like loading 2 elements, >> converting the 2 elements to another type and storing the 2 elements >> with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and >> only occupy part of it. So we adapt the code in >> SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each >> scalar node in the whole vector. In this case, the alignments for 2 >> LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. >> Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which >> mark that this node is the second node in the whole vector, while the >> difference between 4 and 8 are just because of their own data sizes. In >> this situation, we should try to remove the impact caused by different >> data size in SLP. For example, in the stage of >> SuperWord::extend_packlist(), while determining if it's potential to >> pack a pair of def nodes in the function SuperWord::follow_use_defs(), >> we remove the side effect of different data size by transforming the >> target alignment from the use node. Because we believe that, assuming >> that the vector length is 512 bits, if the ConvI2D use nodes have >> alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, >> these two LoadI nodes should be packed as a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type >> conversion between different data size takes a type of one size and >> produces a type of another size, hence the special checks on alignment >> and size should be applied, like what we do in SuperWord::is_vector_use. >> >> After solving these problems, we successfully implemented the >> vectorization of type conversion between different data sizes. >> >> Here is the test data on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op >> >> Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef > > src/hotspot/share/opto/vectornode.cpp line 258: > >> 256: return Op_VectorCastF2X; >> 257: case Op_ConvD2L: >> 258: return Op_VectorCastD2X; > > Why you removed these lines? Yes, removing these seems a wrong step. For x86, we do code generation for these VectorCastI2X, VectorCaseL2X, VectorCastF2X and VectorCastD2X nodes. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From sviswanathan at openjdk.java.net Fri Jun 3 00:54:35 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 00:54:35 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: > Fixed the assertion in load_iota_indices when the length passed is less than 4. > Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Remove requires from test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8961/files - new: https://git.openjdk.java.net/jdk/pull/8961/files/1939a73a..37639f5f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8961&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8961&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8961.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8961/head:pull/8961 PR: https://git.openjdk.java.net/jdk/pull/8961 From sviswanathan at openjdk.java.net Fri Jun 3 00:56:43 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 00:56:43 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v2] In-Reply-To: <247geSQ4hid8KzwDEk7Bj7fpPinBzDL48xrT0SOVfY8=.d42620f8-6ab3-4306-b800-131eb7a908bc@github.com> References: <247geSQ4hid8KzwDEk7Bj7fpPinBzDL48xrT0SOVfY8=.d42620f8-6ab3-4306-b800-131eb7a908bc@github.com> Message-ID: On Fri, 3 Jun 2022 00:38:06 GMT, Jie Fu wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Add regression test > > test/hotspot/jtreg/compiler/vectorization/cr8287517.java line 29: > >> 27: * @summary Test bug fix for JDK-8287517 related to fuzzer test failure in x86_64 >> 28: * @requires vm.compiler2.enabled >> 29: * @requires (os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*") | > > Maybe, we can remove this `require` so that other platforms can also get tested. @DamonFool I removed requires from the test as you suggested. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From fgao at openjdk.java.net Fri Jun 3 01:07:30 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 3 Jun 2022 01:07:30 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v6] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Thu, 2 Jun 2022 23:59:21 GMT, Vladimir Kozlov wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: >> >> - Implement an interface for auto-vectorization to consult supported match rules >> >> Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 >> - Merge branch 'master' into fg8283091 >> >> Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 >> - Add micro-benchmark cases >> >> Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I674581135fd0844accc65520574fcef161eededa >> - 8283091: Support type conversion between different data sizes in SLP >> >> After JDK-8275317, C2's SLP vectorizer has supported type conversion >> between the same data size. We can also support conversions between >> different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems >> in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and >> StoreD. Assuming that the vector length is 128 bits, how many scalar >> nodes should be packed together to a vector? If we decide it >> separately for each operation node, like what we did before the patch >> in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI >> or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes >> in a vector node sequence, like loading 4 elements to a vector, then >> typecasting 2 elements and lastly storing these 2 elements, they become >> invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function >> SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. >> In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then >> generate valid vector node sequence, like loading 2 elements, >> converting the 2 elements to another type and storing the 2 elements >> with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and >> only occupy part of it. So we adapt the code in >> SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each >> scalar node in the whole vector. In this case, the alignments for 2 >> LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. >> Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which >> mark that this node is the second node in the whole vector, while the >> difference between 4 and 8 are just because of their own data sizes. In >> this situation, we should try to remove the impact caused by different >> data size in SLP. For example, in the stage of >> SuperWord::extend_packlist(), while determining if it's potential to >> pack a pair of def nodes in the function SuperWord::follow_use_defs(), >> we remove the side effect of different data size by transforming the >> target alignment from the use node. Because we believe that, assuming >> that the vector length is 512 bits, if the ConvI2D use nodes have >> alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, >> these two LoadI nodes should be packed as a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type >> conversion between different data size takes a type of one size and >> produces a type of another size, hence the special checks on alignment >> and size should be applied, like what we do in SuperWord::is_vector_use. >> >> After solving these problems, we successfully implemented the >> vectorization of type conversion between different data sizes. >> >> Here is the test data on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op >> >> Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef > > src/hotspot/share/opto/vectornode.cpp line 258: > >> 256: return Op_VectorCastF2X; >> 257: case Op_ConvD2L: >> 258: return Op_VectorCastD2X; > > Why you removed these lines? @vnkozlov @sviswa7 , I removed these lines here, because I call another api to generate vector opcodes and vector nodes for conversion nodes [VectorCastNode::opcode()](https://github.com/openjdk/jdk/blob/ba9ee8cb286268f1d6a2820508334aaaf3131e15/src/hotspot/share/opto/vectornode.cpp#L1203), which is more precise than this one and unified with Vector API. Because converting short to double and converting int to double would use the same opcode `ConvI2D` for scalar nodes, we can't differentiate between them just based on the opcode of the current node. I adjusted the callers `VectorCastNode::implemented()` and `SuperWord::output()` to the removing. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From jiefu at openjdk.java.net Fri Jun 3 01:11:33 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 3 Jun 2022 01:11:33 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 00:54:35 GMT, Sandhya Viswanathan wrote: >> Fixed the assertion in load_iota_indices when the length passed is less than 4. >> Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Remove requires from test LGTM Thanks for the update. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8961 From kvn at openjdk.java.net Fri Jun 3 01:44:19 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 01:44:19 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 00:54:35 GMT, Sandhya Viswanathan wrote: >> Fixed the assertion in load_iota_indices when the length passed is less than 4. >> Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Remove requires from test Looks good. I submitted testing. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8961 From sviswanathan at openjdk.java.net Fri Jun 3 02:05:35 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 02:05:35 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: <22K9kHJxgl01Wswwzq6I3cf_k23AsPoEH_wkE6ToZfs=.9d5981e5-6b5d-46f7-aef0-95b0a2c19ece@github.com> On Fri, 3 Jun 2022 01:42:00 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove requires from test > > Looks good. I submitted testing. Thanks a lot @vnkozlov @DamonFool. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From jiefu at openjdk.java.net Fri Jun 3 02:09:31 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 3 Jun 2022 02:09:31 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Thu, 2 Jun 2022 22:12:40 GMT, Sandhya Viswanathan wrote: >> @merykitty @sviswa7 Could you help confirm if byte/short shift operations are vectorizable with all AVX versions of x86? > > @pfustc They are available either directly or through a series of instructions for majority of the architectures. All implemented in x86.ad. The match_rule_supported and match_rule_supported_vector will appropriately return true/false. RShiftVB would fail with `UseSSE < 4`. But is there a x86 machine running jtreg tests with `UseSSE < 4`? For `RShiftV{B,S}`, neither `match_rule_supported ` nor `match_rule_supported_vector ` would return false on x86 with AVX=1. So the test seems fine on x86, right? @sviswa7 @pfustc ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Fri Jun 3 02:19:28 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 3 Jun 2022 02:19:28 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Fri, 3 Jun 2022 02:05:56 GMT, Jie Fu wrote: > But is there a x86 machine running jtreg tests with `UseSSE < 4`? > To clarify: is there a x86 cpu which only supports `UseeSSE < 4` ISA running jtreg tests? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From kvn at openjdk.java.net Fri Jun 3 02:49:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 02:49:34 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v4] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 17:49:04 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review comment resolution Regression testing results are good. Waiting performance results. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Fri Jun 3 03:16:25 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 03:16:25 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v6] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Fri, 3 Jun 2022 01:03:59 GMT, Fei Gao wrote: >> src/hotspot/share/opto/vectornode.cpp line 258: >> >>> 256: return Op_VectorCastF2X; >>> 257: case Op_ConvD2L: >>> 258: return Op_VectorCastD2X; >> >> Why you removed these lines? > > @vnkozlov @sviswa7 , I removed these lines here, because I call another api to generate vector opcodes and vector nodes for conversion nodes [VectorCastNode::opcode()](https://github.com/openjdk/jdk/blob/ba9ee8cb286268f1d6a2820508334aaaf3131e15/src/hotspot/share/opto/vectornode.cpp#L1203), which is more precise than this one and unified with Vector API. Because converting short to double and converting int to double would use the same opcode `ConvI2D` for scalar nodes, we can't differentiate between them just based on the opcode of the current node. I adjusted the callers `VectorCastNode::implemented()` and `SuperWord::output()` to the removing. I see the issue. Calls to `VectorCastNode::opcode()` use basic type of Ideal node which will hide sub-integer types. That is why you need `in->bottom_type()->is_vect()->element_basic_type()`. May be we should have assert here to make sure that in all places we call `VectorCastNode::opcode()` for `Conv*` nodes: default: assert(!VectorNode::is_convert_opcode(sopc), "Convert node %s should be processed by VectorCastNode::opcode()", NodeClassNames[sopc]); return 0; // Unimplemented ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From kvn at openjdk.java.net Fri Jun 3 03:22:36 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 03:22:36 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Fri, 3 Jun 2022 02:15:57 GMT, Jie Fu wrote: >> RShiftVB would fail with `UseSSE < 4`. >> But is there a x86 machine running jtreg tests with `UseSSE < 4`? >> >> For `RShiftV{B,S}`, neither `match_rule_supported ` nor `match_rule_supported_vector ` would return false on x86 with AVX=1. >> So the test seems fine on x86, right? @sviswa7 @pfustc > >> But is there a x86 machine running jtreg tests with `UseSSE < 4`? >> > To clarify: is there a x86 cpu which only supports `UseeSSE < 4` ISA running jtreg tests? We do run vectorization (superword, vectorization, VectorAPI) tests in our testing with: `-XX:UseAVX=0 -XX:UseSSE=3` and `-XX:UseAVX=0 -XX:UseSSE=2` ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Fri Jun 3 03:38:35 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 3 Jun 2022 03:38:35 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v6] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Fri, 3 Jun 2022 03:13:02 GMT, Vladimir Kozlov wrote: > May be we should have assert here to make sure that in all places we call `VectorCastNode::opcode()` for `Conv*` nodes: > > ``` > default: > assert(!VectorNode::is_convert_opcode(sopc), "Convert node %s should be processed by VectorCastNode::opcode()", NodeClassNames[sopc]); > return 0; // Unimplemented > ``` Make sense to me. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From jiefu at openjdk.java.net Fri Jun 3 03:49:23 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 3 Jun 2022 03:49:23 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Fri, 3 Jun 2022 03:20:27 GMT, Vladimir Kozlov wrote: > We do run vectorization (superword, vectorization, VectorAPI) tests in our testing with: `-XX:UseAVX=0 -XX:UseSSE=3` and `-XX:UseAVX=0 -XX:UseSSE=2` `compiler/c2/irTests/TestVectorizeURShiftSubword.java` still passed with `-XX:UseAVX=0 -XX:UseSSE=3` and `-XX:UseAVX=0 -XX:UseSSE=2`. Maybe the IR test framework wouldn't accept VM args like `-XX:UseAVX=? -XX:UseSSE=?`? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Fri Jun 3 03:57:32 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 3 Jun 2022 03:57:32 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Fri, 3 Jun 2022 03:46:18 GMT, Jie Fu wrote: >> We do run vectorization (superword, vectorization, VectorAPI) tests in our testing with: >> `-XX:UseAVX=0 -XX:UseSSE=3` and `-XX:UseAVX=0 -XX:UseSSE=2` > >> We do run vectorization (superword, vectorization, VectorAPI) tests in our testing with: `-XX:UseAVX=0 -XX:UseSSE=3` and `-XX:UseAVX=0 -XX:UseSSE=2` > > > `compiler/c2/irTests/TestVectorizeURShiftSubword.java` still passed with `-XX:UseAVX=0 -XX:UseSSE=3` and `-XX:UseAVX=0 -XX:UseSSE=2`. > > Maybe the IR test framework wouldn't accept VM args like `-XX:UseAVX=? -XX:UseSSE=?`? Two questions @fg1417 : 1. Is `RShiftVB` generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86? 2. If the answer is no for question 1, why `TestVectorizeURShiftSubword.java` get passed? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From kvn at openjdk.java.net Fri Jun 3 05:32:25 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 05:32:25 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 00:54:35 GMT, Sandhya Viswanathan wrote: >> Fixed the assertion in load_iota_indices when the length passed is less than 4. >> Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Remove requires from test Testing results are good. You can push. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From rrich at openjdk.java.net Fri Jun 3 07:57:22 2022 From: rrich at openjdk.java.net (Richard Reingruber) Date: Fri, 3 Jun 2022 07:57:22 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler [v2] In-Reply-To: References: Message-ID: > This fix avoids generating unreachable instructions after the jump to the exception handler in `generate_cont_thaw()` > > Testing: > > jtreg:test/hotspot/jtreg:hotspot_loom > jtreg:test/jdk:jdk_loom > > On linux x86_64 and aarch64. > > On aarch64 the test `jdk/java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java` had a timeout. The aaarch64 machine I used is very slow. This might have caused the timeout. Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into 8287205_remove_dead_code_from_generate_cont_thaw - Merge branch 'master' into 8287205_remove_dead_code_from_generate_cont_thaw - Remove dead code from generate_cont_thaw ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8863/files - new: https://git.openjdk.java.net/jdk/pull/8863/files/9489400f..c82674a7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8863&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8863&range=00-01 Stats: 81814 lines in 880 files changed: 30489 ins; 44462 del; 6863 mod Patch: https://git.openjdk.java.net/jdk/pull/8863.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8863/head:pull/8863 PR: https://git.openjdk.java.net/jdk/pull/8863 From chagedorn at openjdk.java.net Fri Jun 3 08:00:41 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Jun 2022 08:00:41 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 00:54:35 GMT, Sandhya Viswanathan wrote: >> Fixed the assertion in load_iota_indices when the length passed is less than 4. >> Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Remove requires from test Otherwise, looks good! Thanks for adding a test. test/hotspot/jtreg/compiler/vectorization/cr8287517.java line 34: > 32: package compiler.vectorization; > 33: > 34: public class cr8287517 { Please rename the test to something that describes the problem you are writing this test for instead of just using the bug number. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8961 From chagedorn at openjdk.java.net Fri Jun 3 08:07:34 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Jun 2022 08:07:34 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v28] In-Reply-To: <6fvXf0Lpbpbs0fQXNQLimROBnrAUIfJrUoHv3Fd7AkE=.0375bdec-8c24-4f30-9ccc-17095ed973ec@github.com> References: <6fvXf0Lpbpbs0fQXNQLimROBnrAUIfJrUoHv3Fd7AkE=.0375bdec-8c24-4f30-9ccc-17095ed973ec@github.com> Message-ID: On Thu, 2 Jun 2022 14:48:36 GMT, Emanuel Peter wrote: >> **What this gives you for the debugger** >> - BFS traversal (inputs / outputs) >> - node filtering by category >> - shortest path between nodes >> - all paths between nodes >> - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) >> - and more >> >> **Some usecases** >> - more readable `dump` >> - follow only nodes of some categories (only control, only data, etc) >> - find which control nodes depend on data node (visit data nodes, include control in boundary) >> - how two nodes relate (shortest / all paths, following input/output nodes, or both) >> - find loops (control / memory / data: call all paths with node as start and target) >> >> **Description** >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. >> >> `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` >> >> To get familiar with the many options, run this to get help: >> `find_node(0)->dump_bfs(0,0,"h")` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. >> >> **Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. >> 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. >> 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. >> 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. >> 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. >> 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. >> 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! >> >> Example (BFS inputs): >> >> (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") >> d dump >> --------------------------------------------- >> 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> >> >> Example (BFS control inputs): >> >> (rr) p find_node(163)->dump_bfs(5,0,"c+") >> d dump >> --------------------------------------------- >> 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 >> >> We see the control flow of a strip mined loop. >> >> >> Experiment (BFS only data, but display all nodes on boundary) >> >> (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") >> d dump >> --------------------------------------------- >> 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) >> 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] >> >> We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). >> >> Example with Mach nodes: >> >> (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") >> d [head idom d] old dump >> --------------------------------------------- >> 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd >> 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) >> 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] >> 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] >> 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> And the query on the old nodes: >> >> (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") >> d dump >> --------------------------------------------- >> 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] >> 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] >> 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 >> 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] >> 1 o740 Bool === _ o739 [[ o741 ]] [lt] >> 1 o179 IfTrue === o178 [[ o741 ]] #1 >> 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 >> >> >> **Exploring loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "c+")` >> This provides us with a shortest control path, given this path has a distance of at most 20. >> >> Example (shortest path over control nodes): >> >> (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") >> d dump >> --------------------------------------------- >> 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> >> Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. >> >> Example (all paths between two nodes): >> >> (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. >> >> An alternative to detect loops quickly, is running an all paths query from a node to itself: >> >> Example (loop detection with all paths): >> >> (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") >> d apd dump >> --------------------------------------------- >> 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) >> 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We get the loop control, plus the loop-back `190 IfTrue`. >> >> Example (loop detection with all paths for phi): >> >> (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> >> >> **Color examples** >> Colors are especially useful to see chains between nodes (options character `#`). >> The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. >> Tip: it can be worth it to configure the colors of your terminal to be more appealing. >> >> Example (find control dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) >> We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. >> >> Example (find memory dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) >> >> Example (loop detection): >> ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) >> We find the control and some data loop paths. > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - missing style thing from last commit > - another one of Christian's reviews Looks good, thanks for doing the updates. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8468 From jbhateja at openjdk.java.net Fri Jun 3 09:38:04 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 3 Jun 2022 09:38:04 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v9] In-Reply-To: References: Message-ID: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: - 8283894: Adding gtest based constant folding test case. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Extending new IR value routines with value propagation logic. - 8283894: Disabling sanity test as per review suggestion. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Removing CompressExpandSanityTest from problem list. - 8283894: Updating test tag spec. - 8283894: Review comments resolved. - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. - ... and 4 more: https://git.openjdk.java.net/jdk/compare/407abf5d...72895639 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8498/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=08 Stats: 1133 lines in 20 files changed: 1103 ins; 18 del; 12 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From fgao at openjdk.java.net Fri Jun 3 10:06:44 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 3 Jun 2022 10:06:44 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Fri, 3 Jun 2022 03:54:15 GMT, Jie Fu wrote: > Two questions @fg1417 : > > 1. Is `RShiftVB` generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86? > 2. If the answer is no for question 1, why `TestVectorizeURShiftSubword.java` get passed? > Thanks. @DamonFool , no `RShiftB` nodes are generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86. The test passed because when adding VM options `-XX:UseAVX=0 -XX:UseSSE=2` externally, the IR framework won't do the IR check and just run the java code, I verified and made the guess. But I'm not clear whether all external VM options would stop the IR framework to do the IR check. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Fri Jun 3 10:06:44 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 3 Jun 2022 10:06:44 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 00:54:35 GMT, Sandhya Viswanathan wrote: >> Fixed the assertion in load_iota_indices when the length passed is less than 4. >> Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Remove requires from test Marked as reviewed by fgao (Author). ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From duke at openjdk.java.net Fri Jun 3 10:25:57 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 3 Jun 2022 10:25:57 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v8] In-Reply-To: References: Message-ID: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> > Hi, this patch improves some operations on x86_64: > > - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: > + Bounded operands > + Multiple uops both in fused and unfused domains > + May result in flag stall since the operations have unpredictable flag output > > - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: > > xorl dst, dst > sometest > movl tmp, 0x01 > cmovlcc dst, tmp > > into: > > xorl dst, dst > sometest > setbcc dst > > This sequence does not need a spare register and without any drawbacks. > (Note: `movzx` does not work since move elision only occurs with different registers for input and output) > > - Some small improvements: > + Add memory variances to `tzcnt` and `lzcnt` > + Add memory variances to `rolx` and `rorx` > + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) > > The speedup can be observed for variable shift instructions > > Before: > Benchmark (size) Mode Cnt Score Error Units > Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op > Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op > Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op > Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op > Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op > Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op > > After: > Benchmark (size) Mode Cnt Score Error Units > Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op > Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op > Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op > Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op > Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op > Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op > > For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: revert conv2b ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7968/files - new: https://git.openjdk.java.net/jdk/pull/7968/files/337c0bf3..11ebf586 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7968&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7968&range=06-07 Stats: 12 lines in 1 file changed: 2 ins; 2 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/7968.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7968/head:pull/7968 PR: https://git.openjdk.java.net/jdk/pull/7968 From duke at openjdk.java.net Fri Jun 3 10:26:04 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 3 Jun 2022 10:26:04 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7] In-Reply-To: References: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> Message-ID: On Thu, 2 Jun 2022 21:47:04 GMT, Dean Long wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits: >> >> - Resolve conflict >> - ins_cost >> - movzx is not elided with same input and output >> - fix only the needs >> - fix >> - cisc >> - delete benchmark command >> - pipe >> - fix, benchmarks >> - pipe_class >> - ... and 5 more: https://git.openjdk.java.net/jdk/compare/e5041ae3...337c0bf3 > > src/hotspot/cpu/x86/x86_64.ad line 10766: > >> 10764: format %{ "xorl $dst, $dst\t# ci2b\n\t" >> 10765: "testl $src, $src\n\t" >> 10766: "setnz $dst" %} > > What's the advantage of this change? The disadvantage is a spare TEMP register is needed -- we can't reuse src as dst. Yes you are right, the change reduces 1 cycle of latency but may require another register so I reverted it. Thanks a lot. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From roland at openjdk.java.net Fri Jun 3 11:59:14 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 3 Jun 2022 11:59:14 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. [v2] In-Reply-To: References: Message-ID: > With JDK-8277654, the load barrier slow path call doesn't produce raw > memory anymore but the IU barrier call still does. I propose removing > raw memory for that call too which also causes the assert that fails > to be removed. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - new fix - Merge branch 'master' into JDK-8287227 - Revert "fix" This reverts commit aa6f80a7883ee7032f81dbffac5d0257491d7118. - fix ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8958/files - new: https://git.openjdk.java.net/jdk/pull/8958/files/aa6f80a7..5699e042 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8958&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8958&range=00-01 Stats: 54148 lines in 735 files changed: 28730 ins; 18849 del; 6569 mod Patch: https://git.openjdk.java.net/jdk/pull/8958.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8958/head:pull/8958 PR: https://git.openjdk.java.net/jdk/pull/8958 From shade at openjdk.java.net Fri Jun 3 11:59:14 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 3 Jun 2022 11:59:14 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. [v2] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 11:55:08 GMT, Roland Westrelin wrote: >> With JDK-8277654, the load barrier slow path call doesn't produce raw >> memory anymore but the IU barrier call still does. I propose removing >> raw memory for that call too which also causes the assert that fails >> to be removed. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - new fix > - Merge branch 'master' into JDK-8287227 > - Revert "fix" > > This reverts commit aa6f80a7883ee7032f81dbffac5d0257491d7118. > - fix Looks reasonable to me. Please link up JDK-8277654 to this bug. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8958 From roland at openjdk.java.net Fri Jun 3 11:59:14 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 3 Jun 2022 11:59:14 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 14:39:10 GMT, Roman Kennke wrote: > Is it correct, though? I seem to remember that without the memory edges, we may get reordering of the 'SATB' buffer and index accesses between IU-barriers, which would cause troubles? Ok. I propose a different fix then. The state of the MemoryGraphFixer needs to be updated when LRBs are expanded. It used to be the case before JDK-8277654. ------------- PR: https://git.openjdk.java.net/jdk/pull/8958 From chagedorn at openjdk.java.net Fri Jun 3 12:03:43 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Jun 2022 12:03:43 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 17:17:21 GMT, Swati Sharma wrote: > Hi All, > > Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. > > This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. > > Patch extends existing @IR annotation with following two new options:- > > - applyIfTargetFeatureAnd: > Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. > - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. > > Example usage: > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) > > Please review and share your feedback. > > Thanks, > Swati That's a good feature to have. Thanks for adding support for it! This will definitely simplify the vector IR tests. test/hotspot/jtreg/compiler/lib/ir_framework/IR.java line 113: > 111: * IR verifications checks are enforced if any of the specified feature constraint is met. > 112: */ > 113: String[] applyIfTargetFeatureOr() default {}; I'm not sure if we should follow the existing scheme to also have at least `applyIfTargetFeature` for a single constraint or not. Back there when I've introduced these constraints I was not happy with writing `applyIfAnd/Or` for many tests where I actually did not care about `AND` and `OR`. I guess you can leave it like that and we can come back to this and maybe clean these things up with [JDK-8280120](https://bugs.openjdk.java.net/browse/JDK-8280120) which wants to introduce another attribute to filter based on the architecture. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 122: > 120: boolean check = hasRequiredFeaturesAnd(irAnno.applyIfTargetFeatureAnd(), "applyIfTargetFeatureAnd"); > 121: if (!check) { > 122: System.out.println("Disabling IR validation for " + m + ", all feature constraints not met."); Note that this message will be printed inside the test VM together with a lot of other messages (`-XX:+PrintCompilation` etc.). This output will not be shown by default. So, it will be hard to find this message again and it might not provide an additional value. For printing log messages inside the test VM, you can use the `TestFrameworkSocket` which pipes messages to the JTreg driver VM to print them there. Since the JTreg driver VM only does a minimal printing, it can easily be found again towards the end of the output under `"Messages from Test VM"`. Specify a tag and then you can use it like that: TestFrameworkSocket.write("Disabling IR matching for " + m + ": Not all feature constraints met.", "[IREncodingPrinter]", true); JTreg Output: STDOUT: Run Flag VM: [...] Messages from Test VM --------------------- [IREncodingPrinter] Disabling IR matching for test2: Could not match all feature constraints. [...] I guess we could use the same kind of messages for the other `applyIf*` methods above. If you want to add them as well, feel free to do so. Otherwise, this could also be done separately in an RFE at some point. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 159: > 157: if (irAnno.applyIfTargetFeatureAnd().length != 0) { > 158: applyRules++; > 159: TestFormat.checkNoThrow((irAnno.applyIfTargetFeatureAnd().length & 1) == 0, I suggest to use: Suggestion: TestFormat.checkNoThrow((irAnno.applyIfTargetFeatureAnd().length % 2) == 0, instead which seems cleaner. Same for the other check below. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 206: > 204: } > 205: > 206: private boolean hasRequiredFeaturesAnd(String[] andRules, String ruleType) { `ruleType` is always `applyIfTargetFeatureAnd` and can be replaced as such. Suggestion for method name: `hasAllRequiredTargetFeatures()` to follow the existing naming scheme. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 210: > 208: String feature = andRules[i]; > 209: i++; > 210: String value = andRules[i]; You should trim the user defined strings with `trim()` - just in case. Same for `hasRequiredFeaturesOr()` below. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 213: > 211: TestFormat.check((value.contains("true") || value.contains("false")), "Incorrect value in " + ruleType + failAt()); > 212: if (!checkTargetFeature(feature, value)) { > 213: // Rule will not be applied but keep processing the other flags to verify that they are same. Typo: Suggestion: // Rule will not be applied but keep processing the other target features to verify that they are sane. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 214: > 212: if (!checkTargetFeature(feature, value)) { > 213: // Rule will not be applied but keep processing the other flags to verify that they are same. > 214: return false; You should cache the return value to keep processing the remaining target feature user strings to check for format errors (similar to what you are doing in `hasRequiredFeaturesOr()`). We should probably separate the format checking and the actual evaluation of the values at some point. But that would exceed the scope of this RFE. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 220: > 218: } > 219: > 220: private boolean hasRequiredFeaturesOr(String[] orRules, String ruleType) { `ruleType` is always `applyIfTargetFeatureOr` and can be replaced as such. Suggestion for method name: `hasAnyRequiredTargetFeature()` (I think the related existing `hasNoRequiredFlags()` method should be renamed/refactored to follow that convention as well but that's also for another day). test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 233: > 231: > 232: private boolean checkTargetFeature(String feature, String value) { > 233: String s = WHITE_BOX.getCPUFeatures(); I suggest to name it `cpuFeatures` instead: Suggestion: String cpuFeatures = WHITE_BOX.getCPUFeatures(); test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 236: > 234: // Following feature list is in sync with suppressed feature list for KNL target. > 235: // Please refer vm_version_x86.cpp for details. > 236: HashSet knlFeatureSet = new HashSet(); The explicit type argument can be removed: Suggestion: HashSet knlFeatureSet = new HashSet<>(); test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 250: > 248: knlFeatureSet.add("GFNI"); > 249: knlFeatureSet.add("AVX512_BITALG"); > 250: Boolean isKNLFlagEnabled = (Boolean)WHITE_BOX.getBooleanVMFlag("UseKNLSettings"); The cast is not necessary: Suggestion: Boolean isKNLFlagEnabled = WHITE_BOX.getBooleanVMFlag("UseKNLSettings"); test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 257: > 255: (value.contains("false") && !s.contains(feature))) { > 256: return true; > 257: } Could be simplified to: Suggestion: return (value.contains("true") && s.contains(feature)) || (value.contains("false") && !s.contains(feature)); test/hotspot/jtreg/testlibrary_tests/ir_framework/examples/TargetFeatureCheckExample.java line 59: > 57: > 58: @Test > 59: @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "false"}) As this file represents a usage example, please consider using `IRNode.AddVI` by adding a new IR regex to `IRNode`. But it looks like that this test is rather a correctness test for `AddVI` and could be moved to the other vector IR tests. The tests in the `ir_framework.package` are more for providing information about the usage rather than actually testing something meaningful. But I think it's good to have an example how to use `applyIfTargetFeature*` as well. But maybe such an example would better fit into the existing `IRExample.java` file where we have existing IR examples and descriptions for `applyIf*`. ------------- Changes requested by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8999 From jiefu at openjdk.java.net Fri Jun 3 12:05:23 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 3 Jun 2022 12:05:23 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Fri, 3 Jun 2022 10:02:48 GMT, Fei Gao wrote: > > @DamonFool , no `RShiftB` nodes are generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86. The test passed because when adding VM options `-XX:UseAVX=0 -XX:UseSSE=2` externally, the IR framework won't do the IR check and just run the java code, I verified and made the guess. But I'm not clear whether all external VM options would stop the IR framework to do the IR check. Thanks for the verification. Then, I think the test is fine. What do you think? @vnkozlov ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From dlong at openjdk.java.net Fri Jun 3 15:16:26 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 3 Jun 2022 15:16:26 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v8] In-Reply-To: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> References: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> Message-ID: On Fri, 3 Jun 2022 10:25:57 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > revert conv2b Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From duke at openjdk.java.net Fri Jun 3 15:25:37 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 3 Jun 2022 15:25:37 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v8] In-Reply-To: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> References: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> Message-ID: <7LhOVN4DQ5EoiWHDnJ_ifLh9TKlQBVgigRjIv74zH84=.62840516-127c-4df3-9748-2803e87810b8@github.com> On Fri, 3 Jun 2022 10:25:57 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > revert conv2b Thank you very much for the reviews. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From dlong at openjdk.java.net Fri Jun 3 15:40:38 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 3 Jun 2022 15:40:38 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7] In-Reply-To: References: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> Message-ID: On Fri, 3 Jun 2022 10:21:49 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/x86_64.ad line 10766: >> >>> 10764: format %{ "xorl $dst, $dst\t# ci2b\n\t" >>> 10765: "testl $src, $src\n\t" >>> 10766: "setnz $dst" %} >> >> What's the advantage of this change? The disadvantage is a spare TEMP register is needed -- we can't reuse src as dst. > > Yes you are right, the change reduces 1 cycle of latency but may require another register so I reverted it. Thanks a lot. Does this pattern give any improvement? testl $src, $src movl $dst, 0 setnz $dst ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From duke at openjdk.java.net Fri Jun 3 15:45:37 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 3 Jun 2022 15:45:37 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7] In-Reply-To: References: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> Message-ID: On Fri, 3 Jun 2022 15:37:07 GMT, Dean Long wrote: >> Yes you are right, the change reduces 1 cycle of latency but may require another register so I reverted it. Thanks a lot. > > Does this pattern give any improvement? > > testl $src, $src > movl $dst, 0 > setnz $dst `movl r, 0` is not a zero idiom, so a partial register write later of `setcc` would lead to register stall when the value is read. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From dlong at openjdk.java.net Fri Jun 3 15:52:36 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 3 Jun 2022 15:52:36 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7] In-Reply-To: References: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> Message-ID: On Fri, 3 Jun 2022 15:41:50 GMT, Quan Anh Mai wrote: >> Does this pattern give any improvement? >> >> testl $src, $src >> movl $dst, 0 >> setnz $dst > > `movl r, 0` is not a zero idiom, so a partial register write later of `setcc` would lead to register stall when the value is read. OK, thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From kvn at openjdk.java.net Fri Jun 3 15:56:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 15:56:44 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v8] In-Reply-To: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> References: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> Message-ID: On Fri, 3 Jun 2022 10:25:57 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > revert conv2b I will run testing of latest versiuon before sponsoring. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From kvn at openjdk.java.net Fri Jun 3 16:02:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 16:02:38 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler [v2] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 07:57:22 GMT, Richard Reingruber wrote: >> This fix avoids generating unreachable instructions after the jump to the exception handler in `generate_cont_thaw()` >> >> Testing: >> >> jtreg:test/hotspot/jtreg:hotspot_loom >> jtreg:test/jdk:jdk_loom >> >> On linux x86_64 and aarch64. >> >> On aarch64 the test `jdk/java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java` had a timeout. The aaarch64 machine I used is very slow. This might have caused the timeout. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into 8287205_remove_dead_code_from_generate_cont_thaw > - Merge branch 'master' into 8287205_remove_dead_code_from_generate_cont_thaw > - Remove dead code from generate_cont_thaw Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8863 From kvn at openjdk.java.net Fri Jun 3 16:08:36 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 16:08:36 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v28] In-Reply-To: <6fvXf0Lpbpbs0fQXNQLimROBnrAUIfJrUoHv3Fd7AkE=.0375bdec-8c24-4f30-9ccc-17095ed973ec@github.com> References: <6fvXf0Lpbpbs0fQXNQLimROBnrAUIfJrUoHv3Fd7AkE=.0375bdec-8c24-4f30-9ccc-17095ed973ec@github.com> Message-ID: On Thu, 2 Jun 2022 14:48:36 GMT, Emanuel Peter wrote: >> **What this gives you for the debugger** >> - BFS traversal (inputs / outputs) >> - node filtering by category >> - shortest path between nodes >> - all paths between nodes >> - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) >> - and more >> >> **Some usecases** >> - more readable `dump` >> - follow only nodes of some categories (only control, only data, etc) >> - find which control nodes depend on data node (visit data nodes, include control in boundary) >> - how two nodes relate (shortest / all paths, following input/output nodes, or both) >> - find loops (control / memory / data: call all paths with node as start and target) >> >> **Description** >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. >> >> `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` >> >> To get familiar with the many options, run this to get help: >> `find_node(0)->dump_bfs(0,0,"h")` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. >> >> **Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. >> 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. >> 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. >> 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. >> 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. >> 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. >> 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! >> >> Example (BFS inputs): >> >> (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") >> d dump >> --------------------------------------------- >> 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> >> >> Example (BFS control inputs): >> >> (rr) p find_node(163)->dump_bfs(5,0,"c+") >> d dump >> --------------------------------------------- >> 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 >> >> We see the control flow of a strip mined loop. >> >> >> Experiment (BFS only data, but display all nodes on boundary) >> >> (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") >> d dump >> --------------------------------------------- >> 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) >> 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] >> >> We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). >> >> Example with Mach nodes: >> >> (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") >> d [head idom d] old dump >> --------------------------------------------- >> 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd >> 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) >> 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] >> 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] >> 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> And the query on the old nodes: >> >> (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") >> d dump >> --------------------------------------------- >> 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] >> 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] >> 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 >> 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] >> 1 o740 Bool === _ o739 [[ o741 ]] [lt] >> 1 o179 IfTrue === o178 [[ o741 ]] #1 >> 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 >> >> >> **Exploring loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "c+")` >> This provides us with a shortest control path, given this path has a distance of at most 20. >> >> Example (shortest path over control nodes): >> >> (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") >> d dump >> --------------------------------------------- >> 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> >> Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. >> >> Example (all paths between two nodes): >> >> (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. >> >> An alternative to detect loops quickly, is running an all paths query from a node to itself: >> >> Example (loop detection with all paths): >> >> (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") >> d apd dump >> --------------------------------------------- >> 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) >> 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We get the loop control, plus the loop-back `190 IfTrue`. >> >> Example (loop detection with all paths for phi): >> >> (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> >> >> **Color examples** >> Colors are especially useful to see chains between nodes (options character `#`). >> The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. >> Tip: it can be worth it to configure the colors of your terminal to be more appealing. >> >> Example (find control dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) >> We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. >> >> Example (find memory dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) >> >> Example (loop detection): >> ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) >> We find the control and some data loop paths. > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - missing style thing from last commit > - another one of Christian's reviews Update looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8468 From kvn at openjdk.java.net Fri Jun 3 17:45:36 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 17:45:36 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Fri, 3 Jun 2022 10:02:48 GMT, Fei Gao wrote: >> Two questions @fg1417 : >> 1. Is `RShiftVB` generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86? >> 2. If the answer is no for question 1, why `TestVectorizeURShiftSubword.java` get passed? >> Thanks. > >> Two questions @fg1417 : >> >> 1. Is `RShiftVB` generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86? >> 2. If the answer is no for question 1, why `TestVectorizeURShiftSubword.java` get passed? >> Thanks. > > @DamonFool , no `RShiftB` nodes are generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86. The test passed because when adding VM options `-XX:UseAVX=0 -XX:UseSSE=2` externally, the IR framework won't do the IR check and just run the java code, I verified and made the guess. But I'm not clear whether all external VM options would stop the IR framework to do the IR check. As @fg1417 pointed, IR framework will not do any IR verification testing (it does not run test) when some flags are passed to test and not whitelisted (`UseAVX` and `UseSSE` are not on list): https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java#L105 I hacked VM to set `UseSSE=2 UseAVX=0` and test failed as expected because it did not find `RShiftVB` as you said: One or more @IR rules failed: Failed IR Rules (2) of Methods (2) With `UseSSE=4 UseAVX=0` combination test passed. Our testing systems don't have anything less than SSE4 CPUs. That is why it passed our testing. I assume Github actions don't have such old systems too. **I don't see these changes passed pre-submit GitHub testing. Make sure they passed.** Currently there are changes in review #8999 which allow to specify CPU features as condition for testing method. I suggest to leave test as it is (if pre-submit GitHub testing passed). If we hit any issues later we will use #8999 API to fix it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From sviswanathan at openjdk.java.net Fri Jun 3 17:52:34 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 17:52:34 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 05:29:19 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove requires from test > > Testing results are good. You can push. Thanks a lot @vnkozlov. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From sviswanathan at openjdk.java.net Fri Jun 3 18:01:45 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 18:01:45 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v4] In-Reply-To: References: Message-ID: <7s4mEki-BTmyaZkniKkQIMC_0G_W0iTDCz0uh1HjjC4=.60a07a52-4a38-4433-83d9-ecd1d8aebc11@github.com> > Fixed the assertion in load_iota_indices when the length passed is less than 4. > Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Change test name ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8961/files - new: https://git.openjdk.java.net/jdk/pull/8961/files/37639f5f..dd5adaef Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8961&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8961&range=02-03 Stats: 126 lines in 2 files changed: 63 ins; 63 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8961.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8961/head:pull/8961 PR: https://git.openjdk.java.net/jdk/pull/8961 From sviswanathan at openjdk.java.net Fri Jun 3 18:01:48 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 18:01:48 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 [v3] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 07:54:34 GMT, Christian Hagedorn wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove requires from test > > test/hotspot/jtreg/compiler/vectorization/cr8287517.java line 34: > >> 32: package compiler.vectorization; >> 33: >> 34: public class cr8287517 { > > Please rename the test to something that describes the problem you are writing this test for instead of just using the bug number. @chhagedorn Thanks, changed test name to TestSmallVectorPopIndex.java. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From sviswanathan at openjdk.java.net Fri Jun 3 18:03:09 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 18:03:09 GMT Subject: Integrated: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 In-Reply-To: References: Message-ID: On Tue, 31 May 2022 23:02:18 GMT, Sandhya Viswanathan wrote: > Fixed the assertion in load_iota_indices when the length passed is less than 4. > Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. This pull request has now been integrated. Changeset: a0219da9 Author: Sandhya Viswanathan URL: https://git.openjdk.java.net/jdk/commit/a0219da966f3a1cd12d402a816bdd79be778085e Stats: 65 lines in 3 files changed: 64 ins; 0 del; 1 mod 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 Reviewed-by: kvn, jiefu, chagedorn, fgao ------------- PR: https://git.openjdk.java.net/jdk/pull/8961 From jbhateja at openjdk.java.net Fri Jun 3 18:19:54 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 3 Jun 2022 18:19:54 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v8] In-Reply-To: References: Message-ID: <6ATVagAVNCJuGhnwJ3YTLkl7gs3VyZLzpSzq6pc4TfM=.ba3eaa33-429e-4fb3-b12f-1565bb626b75@github.com> On Tue, 31 May 2022 15:36:06 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Extending new IR value routines with value propagation logic. >> - 8283894: Disabling sanity test as per review suggestion. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Removing CompressExpandSanityTest from problem list. >> - 8283894: Updating test tag spec. >> - 8283894: Review comments resolved. >> - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. >> - 8283894: Review comments resolutions. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - ... and 2 more: https://git.openjdk.java.net/jdk/compare/2611d47f...a36dba2e > > Marked as reviewed by psandoz (Reviewer). Hi @PaulSandoz , @vnkozlov ; Can you kindly run this through Oracle test framework. All comments are addressed. Hi @PaulSandoz , @vnkozlov ; Can you kindly run this through Oracle test framework. All comments are addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From kvn at openjdk.java.net Fri Jun 3 18:36:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 18:36:41 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v10] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 19:16:45 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 25 additional commits since the last revision: > > - Merge branch 'master' into JDK-8286104 > - Remame all methods to _unstable_if_trap(s) and group them. > - move preprocess() after remove Useless. > - Refactor per reviewer's feedback. > - Remove useless flag. if jdwp is on, liveness_at_bci() marks all local > variables live. > - support option AggressiveLivessForUnstableIf > - Merge branch 'master' into JDK-8286104 > - update comments. > - Merge branch 'master' into JDK-8286104 > - reimplement process_unstable_ifs > - ... and 15 more: https://git.openjdk.java.net/jdk/compare/b9bf362c...4130cd10 I submitted new testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Fri Jun 3 20:22:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 20:22:44 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v8] In-Reply-To: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> References: <0H14skDFABRCwukZLQ6sqMih5bXVxJZc21u_4hu5rHY=.d286babe-8cc2-44b4-a192-324286471563@github.com> Message-ID: On Fri, 3 Jun 2022 10:25:57 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > revert conv2b Testing results are good. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From duke at openjdk.java.net Fri Jun 3 20:22:45 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 3 Jun 2022 20:22:45 GMT Subject: Integrated: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 In-Reply-To: References: Message-ID: <7YmjtUPa4_ywiVOzrl4SB9IZCSpgcqmQBa-fxlYCz0g=.0d8851e9-b90d-4ee4-82cf-9993767c03d5@github.com> On Sat, 26 Mar 2022 06:14:29 GMT, Quan Anh Mai wrote: > Hi, this patch improves some operations on x86_64: > > - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: > + Bounded operands > + Multiple uops both in fused and unfused domains > + May result in flag stall since the operations have unpredictable flag output > > - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: > > xorl dst, dst > sometest > movl tmp, 0x01 > cmovlcc dst, tmp > > into: > > xorl dst, dst > sometest > setbcc dst > > This sequence does not need a spare register and without any drawbacks. > (Note: `movzx` does not work since move elision only occurs with different registers for input and output) > > - Some small improvements: > + Add memory variances to `tzcnt` and `lzcnt` > + Add memory variances to `rolx` and `rorx` > + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) > > The speedup can be observed for variable shift instructions > > Before: > Benchmark (size) Mode Cnt Score Error Units > Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op > Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op > Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op > Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op > Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op > Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op > > After: > Benchmark (size) Mode Cnt Score Error Units > Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op > Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op > Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op > Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op > Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op > Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op > > For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. > > Thank you very much. This pull request has now been integrated. Changeset: 0b35460f Author: Quan Anh Mai Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/0b35460fa00bfdca63a311a7379819cf102dee86 Stats: 602 lines in 8 files changed: 563 ins; 4 del; 35 mod 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From rrich at openjdk.java.net Fri Jun 3 20:37:22 2022 From: rrich at openjdk.java.net (Richard Reingruber) Date: Fri, 3 Jun 2022 20:37:22 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler [v2] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 16:00:31 GMT, Vladimir Kozlov wrote: > Good. Thank's for the reviews Aleksey and Vladimir. I've got 2 weeks off now. Will integrate when back. Richard. ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From kvn at openjdk.java.net Fri Jun 3 20:52:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 20:52:44 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v10] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 19:16:45 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 25 additional commits since the last revision: > > - Merge branch 'master' into JDK-8286104 > - Remame all methods to _unstable_if_trap(s) and group them. > - move preprocess() after remove Useless. > - Refactor per reviewer's feedback. > - Remove useless flag. if jdwp is on, liveness_at_bci() marks all local > variables live. > - support option AggressiveLivessForUnstableIf > - Merge branch 'master' into JDK-8286104 > - update comments. > - Merge branch 'master' into JDK-8286104 > - reimplement process_unstable_ifs > - ... and 15 more: https://git.openjdk.java.net/jdk/compare/8199747e...4130cd10 2 tests failed so far. I put information into RFE. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From sviswanathan at openjdk.java.net Fri Jun 3 21:52:52 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Jun 2022 21:52:52 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 11:09:09 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rewrite the scalar calculation to avoid inline > > Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 > - Merge branch 'master' into fg8283307 > > Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 > - Remove related comments in some test files > > Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 > - Merge branch 'master' into fg8283307 > > Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a > - 8283307: Vectorize unsigned shift right on signed subword types > > ``` > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > ``` > In C2's SLP, vectorization of unsigned shift right on signed > subword types (byte/short) like the case above is intentionally > disabled[1]. Because the vector unsigned shift on signed > subword types behaves differently from the Java spec. It's > worthy to vectorize more cases in quite low cost. Also, > unsigned shift right on signed subword is not uncommon and we > may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > > Short: > | <- 16 bits -> | <- 16 bits -> | > | 1 1 1 ... 1 1 | data | > > when the shift amount is a constant not greater than the number > of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be > transformed into a signed shift and hence becomes vectorizable. > Here is the transformation: > > For T_SHORT (shift <= 16): > src RShiftCntV shift src RShiftCntV shift > \ / ==> \ / > URShiftVS RShiftVS > > This patch does the transformation in SuperWord::implemented() and > SuperWord::output(). It helps vectorize the short cases above. We > can handle unsigned right shift on byte type in a similar way. The > generated assembly code for one iteration on aarch64 is like: > ``` > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > ``` > > Here is the performance data for micro-benchmark before and after > this patch on both AArch64 and x64 machines. We can observe about > ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 Very nice work! Patch looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7979 From kvn at openjdk.java.net Fri Jun 3 23:35:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 23:35:35 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v9] In-Reply-To: References: Message-ID: <_IV3WC5R0KF8o3T_ckyUuWx2TS7QKliPdlAU-QranyQ=.2b365206-81f4-4418-90f0-f17d7d13e6d8@github.com> On Fri, 3 Jun 2022 09:38:04 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - 8283894: Adding gtest based constant folding test case. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Extending new IR value routines with value propagation logic. > - 8283894: Disabling sanity test as per review suggestion. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Removing CompressExpandSanityTest from problem list. > - 8283894: Updating test tag spec. > - 8283894: Review comments resolved. > - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. > - ... and 4 more: https://git.openjdk.java.net/jdk/compare/407abf5d...72895639 I submitted testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From kvn at openjdk.java.net Fri Jun 3 23:53:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Jun 2022 23:53:38 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v8] In-Reply-To: <6ATVagAVNCJuGhnwJ3YTLkl7gs3VyZLzpSzq6pc4TfM=.ba3eaa33-429e-4fb3-b12f-1565bb626b75@github.com> References: <6ATVagAVNCJuGhnwJ3YTLkl7gs3VyZLzpSzq6pc4TfM=.ba3eaa33-429e-4fb3-b12f-1565bb626b75@github.com> Message-ID: On Fri, 3 Jun 2022 18:17:34 GMT, Jatin Bhateja wrote: >> Marked as reviewed by psandoz (Reviewer). > > Hi @PaulSandoz , @vnkozlov ; Can you kindly run this through Oracle test framework. All comments are addressed. @jatin-bhateja you will need to merge again. Patching failed for assembler_x86.cpp due to recent (3 hours ago) changes. I manually fixed the patch to submit testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From jiefu at openjdk.java.net Sat Jun 4 01:06:32 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Sat, 4 Jun 2022 01:06:32 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: On Fri, 3 Jun 2022 17:42:24 GMT, Vladimir Kozlov wrote: > I don't see these changes passed pre-submit GitHub testing. Make sure they passed. I don't know why the pre-submit GitHub testing didn't run in this PR. So I create a draft PR for this patch to run the pre-submit testing: https://github.com/openjdk/jdk/pull/9026 . ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Sat Jun 4 01:06:34 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Sat, 4 Jun 2022 01:06:34 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> Message-ID: <45dJRZ3oJI-HA9BSUfr1BQaOVkMbq9iLpVqnmKfBCRk=.7105e9cd-05d0-481e-921f-09344374a915@github.com> On Fri, 3 Jun 2022 10:02:48 GMT, Fei Gao wrote: >> Two questions @fg1417 : >> 1. Is `RShiftVB` generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86? >> 2. If the answer is no for question 1, why `TestVectorizeURShiftSubword.java` get passed? >> Thanks. > >> Two questions @fg1417 : >> >> 1. Is `RShiftVB` generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86? >> 2. If the answer is no for question 1, why `TestVectorizeURShiftSubword.java` get passed? >> Thanks. > > @DamonFool , no `RShiftB` nodes are generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86. The test passed because when adding VM options `-XX:UseAVX=0 -XX:UseSSE=2` externally, the IR framework won't do the IR check and just run the java code, I verified and made the guess. But I'm not clear whether all external VM options would stop the IR framework to do the IR check. > As @fg1417 pointed, IR framework will not do any IR verification testing (it does not run test) when some flags are passed to test and not whitelisted (`UseAVX` and `UseSSE` are not on list): > https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java#L105 Ah, I see. Thanks @vnkozlov for your explanation. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From dlong at openjdk.java.net Sat Jun 4 02:05:39 2022 From: dlong at openjdk.java.net (Dean Long) Date: Sat, 4 Jun 2022 02:05:39 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap In-Reply-To: References: Message-ID: <9C9P4NROYxVuWTJejFnYwQOGPovUstzWACIboIQWTDw=.2977b16b-c175-4774-97af-60071b805f46@github.com> On Tue, 17 May 2022 23:19:54 GMT, Yi-Fan Tsai wrote: > 8263377: Store method handle linkers in the 'non-nmethods' heap src/hotspot/share/ci/ciMethod.cpp line 1146: > 1144: CodeBlob* code = get_Method()->code(); > 1145: if (code != NULL && code->is_compiled()) { > 1146: code->as_compiled_method()->log_identity(log); Doesn't this change the log output? ------------- PR: https://git.openjdk.java.net/jdk/pull/8760 From kvn at openjdk.java.net Sat Jun 4 05:18:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 4 Jun 2022 05:18:41 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v9] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 09:38:04 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - 8283894: Adding gtest based constant folding test case. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Extending new IR value routines with value propagation logic. > - 8283894: Disabling sanity test as per review suggestion. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Removing CompressExpandSanityTest from problem list. > - 8283894: Updating test tag spec. > - 8283894: Review comments resolved. > - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. > - ... and 4 more: https://git.openjdk.java.net/jdk/compare/407abf5d...72895639 In tier1 new test test_compress_expand_bits.cpp failed. I added comment to RFE. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From jiefu at openjdk.java.net Sat Jun 4 05:20:32 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Sat, 4 Jun 2022 05:20:32 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: <45dJRZ3oJI-HA9BSUfr1BQaOVkMbq9iLpVqnmKfBCRk=.7105e9cd-05d0-481e-921f-09344374a915@github.com> References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> <45dJRZ3oJI-HA9BSUfr1BQaOVkMbq9iLpVqnmKfBCRk=.7105e9cd-05d0-481e-921f-09344374a915@github.com> Message-ID: On Sat, 4 Jun 2022 01:02:30 GMT, Jie Fu wrote: >>> Two questions @fg1417 : >>> >>> 1. Is `RShiftVB` generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86? >>> 2. If the answer is no for question 1, why `TestVectorizeURShiftSubword.java` get passed? >>> Thanks. >> >> @DamonFool , no `RShiftB` nodes are generated with `-XX:UseAVX=0 -XX:UseSSE=2` on x86. The test passed because when adding VM options `-XX:UseAVX=0 -XX:UseSSE=2` externally, the IR framework won't do the IR check and just run the java code, I verified and made the guess. But I'm not clear whether all external VM options would stop the IR framework to do the IR check. > >> As @fg1417 pointed, IR framework will not do any IR verification testing (it does not run test) when some flags are passed to test and not whitelisted (`UseAVX` and `UseSSE` are not on list): >> https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java#L105 > > Ah, I see. > Thanks @vnkozlov for your explanation. > > I don't see these changes passed pre-submit GitHub testing. Make sure they passed. > > I don't know why the pre-submit GitHub testing didn't run in this PR. So I create a draft PR for this patch to run the pre-submit testing: #9026 . The pre-submit testing finished without regression. @vnkozlov @fg1417 Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From shade at openjdk.java.net Sat Jun 4 05:52:24 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Sat, 4 Jun 2022 05:52:24 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler [v2] In-Reply-To: References: Message-ID: <-qrcsbvKZutU1iJKMk6OjVzV_-z16X3U5MaUX6y5Jhw=.9776ca28-e680-4b74-a049-0e92ea242f9e@github.com> On Fri, 3 Jun 2022 20:34:03 GMT, Richard Reingruber wrote: > Thank's for the reviews Aleksey and Vladimir. I've got 2 weeks off now. Will integrate when back. In two weeks, we are going to be in RDP1, so this issue might not be integratable then. I think this issue is simple enough to push now, and the risk for it is small. ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From rrich at openjdk.java.net Sat Jun 4 13:37:25 2022 From: rrich at openjdk.java.net (Richard Reingruber) Date: Sat, 4 Jun 2022 13:37:25 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler [v2] In-Reply-To: <-qrcsbvKZutU1iJKMk6OjVzV_-z16X3U5MaUX6y5Jhw=.9776ca28-e680-4b74-a049-0e92ea242f9e@github.com> References: <-qrcsbvKZutU1iJKMk6OjVzV_-z16X3U5MaUX6y5Jhw=.9776ca28-e680-4b74-a049-0e92ea242f9e@github.com> Message-ID: <587IXzqFThelQpPNx4qb3BhvkWrt4LwFAfoZKr7gXFc=.02317788-f81d-422b-b3fd-89663e07acee@github.com> On Sat, 4 Jun 2022 05:49:19 GMT, Aleksey Shipilev wrote: > > Thank's for the reviews Aleksey and Vladimir. I've got 2 weeks off now. Will integrate when back. > > > > In two weeks, we are going to be in RDP1, so this issue might not be integratable then. I think this issue is simple enough to push now, and the risk for it is small. I agree, the risk is small (as the gain is). But maybe it blocks your changes? Do you want me to push now? You could also include this small enhancement in your work and I'd close this PR. ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From kvn at openjdk.java.net Sat Jun 4 16:12:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 4 Jun 2022 16:12:39 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 11:09:09 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rewrite the scalar calculation to avoid inline > > Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 > - Merge branch 'master' into fg8283307 > > Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 > - Remove related comments in some test files > > Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 > - Merge branch 'master' into fg8283307 > > Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a > - 8283307: Vectorize unsigned shift right on signed subword types > > ``` > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > ``` > In C2's SLP, vectorization of unsigned shift right on signed > subword types (byte/short) like the case above is intentionally > disabled[1]. Because the vector unsigned shift on signed > subword types behaves differently from the Java spec. It's > worthy to vectorize more cases in quite low cost. Also, > unsigned shift right on signed subword is not uncommon and we > may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > > Short: > | <- 16 bits -> | <- 16 bits -> | > | 1 1 1 ... 1 1 | data | > > when the shift amount is a constant not greater than the number > of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be > transformed into a signed shift and hence becomes vectorizable. > Here is the transformation: > > For T_SHORT (shift <= 16): > src RShiftCntV shift src RShiftCntV shift > \ / ==> \ / > URShiftVS RShiftVS > > This patch does the transformation in SuperWord::implemented() and > SuperWord::output(). It helps vectorize the short cases above. We > can handle unsigned right shift on byte type in a similar way. The > generated assembly code for one iteration on aarch64 is like: > ``` > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > ``` > > Here is the performance data for micro-benchmark before and after > this patch on both AArch64 and x64 machines. We can observe about > ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 Thank you for verifying with GHA testing. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7979 From kvn at openjdk.java.net Sat Jun 4 16:20:25 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 4 Jun 2022 16:20:25 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v10] In-Reply-To: References: Message-ID: <9o8fXgQUo5J0LvKlWkLq-xmR16XInT_xWCV8ruauD30=.4a6ad1af-ad96-4d34-aca1-4bb68cc96782@github.com> On Fri, 3 Jun 2022 20:48:30 GMT, Vladimir Kozlov wrote: > 2 tests failed so far. I put information into RFE. No other new failures in my tier1-7 testing. I think after you address found issue it will be ready to integrate (after second review by other Reviewer). But I would suggest to push it into JDK 20 after 19 is forked in one week to get more testing before release. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From sviswanathan at openjdk.java.net Sat Jun 4 22:40:01 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Sat, 4 Jun 2022 22:40:01 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 Message-ID: Currently the C2 JIT only supports float -> int and double -> long conversion for x86. This PR adds the support for following conversions in the c2 JIT: float -> long, short, byte double -> int, short, byte The performance gain is as follows. Before the patch: Benchmark Mode Cnt Score Error Units VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms After the patch: Benchmark Mode Cnt Score Error Units VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms Please review. Best Regards, Sandhya ------------- Commit messages: - fix condition - 8287835: Add support for float/double to integral conversion for x86 Changes: https://git.openjdk.java.net/jdk/pull/9032/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287835 Stats: 252 lines in 6 files changed: 235 ins; 0 del; 17 mod Patch: https://git.openjdk.java.net/jdk/pull/9032.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9032/head:pull/9032 PR: https://git.openjdk.java.net/jdk/pull/9032 From kvn at openjdk.java.net Sun Jun 5 01:48:32 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sun, 5 Jun 2022 01:48:32 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 In-Reply-To: References: Message-ID: On Sat, 4 Jun 2022 22:13:32 GMT, Sandhya Viswanathan wrote: > Currently the C2 JIT only supports float -> int and double -> long conversion for x86. > This PR adds the support for following conversions in the c2 JIT: > float -> long, short, byte > double -> int, short, byte > > The performance gain is as follows. > Before the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms > > After the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms > > Please review. > > Best Regards, > Sandhya I assume it is support for "vector conversion". Please, add IR framework test. src/hotspot/cpu/x86/x86.ad line 1877: > 1875: if (is_integral_type(bt) && !VM_Version::supports_avx512dq()) { > 1876: return false; > 1877: } Overlapping conditions for the same types are confusing. src/hotspot/cpu/x86/x86.ad line 1889: > 1887: return false; > 1888: } > 1889: if ((bt == T_LONG) && !VM_Version::supports_avx512dq()) { Again overlapping conditions. So T_LONG requires both: AVX512, avx512vl and avx512dq? What about T_INT? src/hotspot/cpu/x86/x86.ad line 7298: > 7296: predicate(((VM_Version::supports_avx512vl() || > 7297: Matcher::vector_length_in_bytes(n) == 64)) && > 7298: is_integral_type(Matcher::vector_element_basic_type(n))); Do we need some of these conditions since you have them already in `match_rule_supported_vector()`? ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From fgao at openjdk.java.net Sun Jun 5 13:05:36 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Sun, 5 Jun 2022 13:05:36 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: On Sat, 4 Jun 2022 16:09:23 GMT, Vladimir Kozlov wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Rewrite the scalar calculation to avoid inline >> >> Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 >> - Merge branch 'master' into fg8283307 >> >> Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 >> - Remove related comments in some test files >> >> Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 >> - Merge branch 'master' into fg8283307 >> >> Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a >> - 8283307: Vectorize unsigned shift right on signed subword types >> >> ``` >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> ``` >> In C2's SLP, vectorization of unsigned shift right on signed >> subword types (byte/short) like the case above is intentionally >> disabled[1]. Because the vector unsigned shift on signed >> subword types behaves differently from the Java spec. It's >> worthy to vectorize more cases in quite low cost. Also, >> unsigned shift right on signed subword is not uncommon and we >> may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> >> Short: >> | <- 16 bits -> | <- 16 bits -> | >> | 1 1 1 ... 1 1 | data | >> >> when the shift amount is a constant not greater than the number >> of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be >> transformed into a signed shift and hence becomes vectorizable. >> Here is the transformation: >> >> For T_SHORT (shift <= 16): >> src RShiftCntV shift src RShiftCntV shift >> \ / ==> \ / >> URShiftVS RShiftVS >> >> This patch does the transformation in SuperWord::implemented() and >> SuperWord::output(). It helps vectorize the short cases above. We >> can handle unsigned right shift on byte type in a similar way. The >> generated assembly code for one iteration on aarch64 is like: >> ``` >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> ``` >> >> Here is the performance data for micro-benchmark before and after >> this patch on both AArch64 and x64 machines. We can observe about >> ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ >> >> Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 > > Thank you for verifying with GHA testing. Thanks for all your kind review, @vnkozlov @pfustc @sviswa7 . ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Sun Jun 5 13:05:38 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Sun, 5 Jun 2022 13:05:38 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: <1XRNVIUQjE2jEYRR766gwn2TFc3SXGH6H_XiORuCywk=.b518ba0e-632c-42e0-a0a9-4779221b50da@github.com> <45dJRZ3oJI-HA9BSUfr1BQaOVkMbq9iLpVqnmKfBCRk=.7105e9cd-05d0-481e-921f-09344374a915@github.com> Message-ID: On Sat, 4 Jun 2022 05:16:54 GMT, Jie Fu wrote: >>> As @fg1417 pointed, IR framework will not do any IR verification testing (it does not run test) when some flags are passed to test and not whitelisted (`UseAVX` and `UseSSE` are not on list): >>> https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java#L105 >> >> Ah, I see. >> Thanks @vnkozlov for your explanation. > >> > I don't see these changes passed pre-submit GitHub testing. Make sure they passed. >> >> I don't know why the pre-submit GitHub testing didn't run in this PR. So I create a draft PR for this patch to run the pre-submit testing: #9026 . > > The pre-submit testing finished without regression. @vnkozlov @fg1417 > Thanks. Thanks for your help on GHA testing @DamonFool . ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From shade at openjdk.java.net Sun Jun 5 13:41:24 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Sun, 5 Jun 2022 13:41:24 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler [v2] In-Reply-To: <587IXzqFThelQpPNx4qb3BhvkWrt4LwFAfoZKr7gXFc=.02317788-f81d-422b-b3fd-89663e07acee@github.com> References: <-qrcsbvKZutU1iJKMk6OjVzV_-z16X3U5MaUX6y5Jhw=.9776ca28-e680-4b74-a049-0e92ea242f9e@github.com> <587IXzqFThelQpPNx4qb3BhvkWrt4LwFAfoZKr7gXFc=.02317788-f81d-422b-b3fd-89663e07acee@github.com> Message-ID: On Sat, 4 Jun 2022 13:33:57 GMT, Richard Reingruber wrote: > > > Thank's for the reviews Aleksey and Vladimir. I've got 2 weeks off now. Will integrate when back. > > > > > > In two weeks, we are going to be in RDP1, so this issue might not be integratable then. I think this issue is simple enough to push now, and the risk for it is small. > > I agree, the risk is small (as the gain is). But maybe it blocks your changes? Do you want me to push now? You could also include this small enhancement in your work and I'd close this PR. Just push it now. I will handle the fallout, if any, while you are away. Other arches are likely to use x86_64 code as the template, so minor differences would accumulate if we delay this PR. ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From jbhateja at openjdk.java.net Sun Jun 5 14:39:18 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Sun, 5 Jun 2022 14:39:18 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v10] In-Reply-To: References: Message-ID: <8mSOwbZG68y-QN6Q1MIAy7ejUHZwpX9fqeZiH3DM0B0=.d5f32abb-92cc-4b9b-9890-2cf40753f1e1@github.com> > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: - 8283894: Updating literal suffixes for gtest failure on Windows. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Adding gtest based constant folding test case. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Extending new IR value routines with value propagation logic. - 8283894: Disabling sanity test as per review suggestion. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Removing CompressExpandSanityTest from problem list. - 8283894: Updating test tag spec. - ... and 6 more: https://git.openjdk.java.net/jdk/compare/308c068b...1ad86e01 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8498/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=09 Stats: 1133 lines in 20 files changed: 1103 ins; 18 del; 12 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From jbhateja at openjdk.java.net Sun Jun 5 14:39:19 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Sun, 5 Jun 2022 14:39:19 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v9] In-Reply-To: References: Message-ID: On Sat, 4 Jun 2022 05:14:41 GMT, Vladimir Kozlov wrote: > In tier1 new test test_compress_expand_bits.cpp failed. I added comment to RFE. Hi @vnkozlov , Its fixed now, kindly re-run though you test framework to verify. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From kvn at openjdk.java.net Sun Jun 5 16:52:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sun, 5 Jun 2022 16:52:38 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v10] In-Reply-To: <8mSOwbZG68y-QN6Q1MIAy7ejUHZwpX9fqeZiH3DM0B0=.d5f32abb-92cc-4b9b-9890-2cf40753f1e1@github.com> References: <8mSOwbZG68y-QN6Q1MIAy7ejUHZwpX9fqeZiH3DM0B0=.d5f32abb-92cc-4b9b-9890-2cf40753f1e1@github.com> Message-ID: On Sun, 5 Jun 2022 14:39:18 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: > > - 8283894: Updating literal suffixes for gtest failure on Windows. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Adding gtest based constant folding test case. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Extending new IR value routines with value propagation logic. > - 8283894: Disabling sanity test as per review suggestion. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Removing CompressExpandSanityTest from problem list. > - 8283894: Updating test tag spec. > - ... and 6 more: https://git.openjdk.java.net/jdk/compare/308c068b...1ad86e01 Marked as reviewed by kvn (Reviewer). Tier1 testing passed clean. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From rrich at openjdk.java.net Sun Jun 5 19:32:53 2022 From: rrich at openjdk.java.net (Richard Reingruber) Date: Sun, 5 Jun 2022 19:32:53 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler [v2] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 07:57:22 GMT, Richard Reingruber wrote: >> This fix avoids generating unreachable instructions after the jump to the exception handler in `generate_cont_thaw()` >> >> Testing: >> >> jtreg:test/hotspot/jtreg:hotspot_loom >> jtreg:test/jdk:jdk_loom >> >> On linux x86_64 and aarch64. >> >> On aarch64 the test `jdk/java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java` had a timeout. The aaarch64 machine I used is very slow. This might have caused the timeout. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into 8287205_remove_dead_code_from_generate_cont_thaw > - Merge branch 'master' into 8287205_remove_dead_code_from_generate_cont_thaw > - Remove dead code from generate_cont_thaw Sure. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From rrich at openjdk.java.net Sun Jun 5 19:32:54 2022 From: rrich at openjdk.java.net (Richard Reingruber) Date: Sun, 5 Jun 2022 19:32:54 GMT Subject: Integrated: 8287205: generate_cont_thaw generates dead code after jump to exception handler In-Reply-To: References: Message-ID: On Tue, 24 May 2022 07:33:02 GMT, Richard Reingruber wrote: > This fix avoids generating unreachable instructions after the jump to the exception handler in `generate_cont_thaw()` > > Testing: > > jtreg:test/hotspot/jtreg:hotspot_loom > jtreg:test/jdk:jdk_loom > > On linux x86_64 and aarch64. > > On aarch64 the test `jdk/java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java` had a timeout. The aaarch64 machine I used is very slow. This might have caused the timeout. This pull request has now been integrated. Changeset: ebc012ec Author: Richard Reingruber URL: https://git.openjdk.java.net/jdk/commit/ebc012ece28ea731c4756cab2374ebecfa5ac1a3 Stats: 16 lines in 2 files changed: 8 ins; 8 del; 0 mod 8287205: generate_cont_thaw generates dead code after jump to exception handler Reviewed-by: shade, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From xliu at openjdk.java.net Sun Jun 5 22:59:13 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Sun, 5 Jun 2022 22:59:13 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v11] In-Reply-To: References: Message-ID: <9LEaNeG2c7dOaFkKn63VjFWt9N_T0wD90hUNt7e3M2E=.734048e7-03b1-4c53-a75c-db8bdc947656@github.com> > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: Bail out if fold-compares sees that a unstable_if trap has modified. Also add a regression test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/4130cd10..9c917371 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=10 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=09-10 Stats: 117 lines in 5 files changed: 105 ins; 1 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From jbhateja at openjdk.java.net Mon Jun 6 00:41:45 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 6 Jun 2022 00:41:45 GMT Subject: Integrated: 8283894: Intrinsify compress and expand bits on x86 In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:19:53 GMT, Jatin Bhateja wrote: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: f347ff99 Author: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/f347ff9986afbc578aca8784be658d3629904786 Stats: 1133 lines in 20 files changed: 1103 ins; 18 del; 12 mod 8283894: Intrinsify compress and expand bits on x86 Reviewed-by: psandoz, sviswanathan, jrose, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From xliu at openjdk.java.net Mon Jun 6 01:23:38 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Mon, 6 Jun 2022 01:23:38 GMT Subject: RFR: 8287840: Dead copy region node blocks IfNode's fold-compares Message-ID: 8287840: Dead copy region node blocks IfNode's fold-compares ------------- Commit messages: - Update copyright year. - 8287840: Dead copy region node blocks IfNode's fold-compares Changes: https://git.openjdk.java.net/jdk/pull/9035/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9035&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287840 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/9035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9035/head:pull/9035 PR: https://git.openjdk.java.net/jdk/pull/9035 From xgong at openjdk.java.net Mon Jun 6 01:38:29 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Mon, 6 Jun 2022 01:38:29 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: <7BACvqeUZFJbVq36mElnVBWg2vXyN6kVUXYNKvJ7cuA=.a04e6924-006b-43f3-adec-97132d5a719d@github.com> References: <_c_QPZQIL-ZxBs9TaKmrh7_1WcbEDH1pUwhTpOc6PD8=.75e4a61b-ebb6-491c-9c5b-9a035f0b9eaf@github.com> <7BACvqeUZFJbVq36mElnVBWg2vXyN6kVUXYNKvJ7cuA=.a04e6924-006b-43f3-adec-97132d5a719d@github.com> Message-ID: On Thu, 2 Jun 2022 03:24:07 GMT, Xiaohong Gong wrote: >>> @XiaohongGong Could you please rebase the branch and resolve conflicts? >> >> Sure, I'm working on this now. The patch will be updated soon. Thanks. > >> > @XiaohongGong Could you please rebase the branch and resolve conflicts? >> >> Sure, I'm working on this now. The patch will be updated soon. Thanks. > > Resolved the conflicts. Thanks! > @XiaohongGong You need one more review approval. Sure! Thanks a lot for the review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Mon Jun 6 01:38:31 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Mon, 6 Jun 2022 01:38:31 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > Yes, the tests were run in debug mode. The reporting of the missing constant occurs for the compiled method that is called from the method where the constants are declared e.g.: > > 719 240 b jdk.incubator.vector.Int256Vector::fromArray0 (15 bytes) > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it > ** missing constant: offsetInRange=Parm > @ 11 jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline by annotation > > > So it appears to be working as expected. A similar pattern occurs at a lower-level for the passing of the mask class. `Int256Vector::fromArray0` passes a constant class to `IntVector::fromArray0Template` (the compilation of which bails out before checking that the `offsetInRange` is constant). Hi @PaulSandoz , could you please take a look at this PR again? Thanks so much! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From fgao at openjdk.java.net Mon Jun 6 02:06:39 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 6 Jun 2022 02:06:39 GMT Subject: Integrated: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: <8X54E6tvA8atXXxzPzfnaRmDGkWDFF2m6-55asR5nz4=.3e89b180-845a-4343-8af6-a4463df10f7d@github.com> On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ This pull request has now been integrated. Changeset: 24fe8ad7 Author: Fei Gao Committer: Pengfei Li URL: https://git.openjdk.java.net/jdk/commit/24fe8ad74cc481d18bed6896ca54a8d91c651d4a Stats: 217 lines in 9 files changed: 211 ins; 6 del; 0 mod 8283307: Vectorize unsigned shift right on signed subword types Reviewed-by: jiefu, pli, sviswanathan, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From xgong at openjdk.java.net Mon Jun 6 09:48:08 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Mon, 6 Jun 2022 09:48:08 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE Message-ID: VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. Here is an example for vector load and add reduction inside a loop: ptrue p0.s, vl8 ; mask generation ld1w {z16.s}, p0/z, [x14] ; load vector ptrue p0.s, vl8 ; mask generation uaddv d17, p0, z16.s ; add reduction smov x14, v17.s[0] As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: Benchmark size Gain Byte256Vector.ADDLanes 1024 0.999 Byte256Vector.ANDLanes 1024 1.065 Byte256Vector.MAXLanes 1024 1.064 Byte256Vector.MINLanes 1024 1.062 Byte256Vector.ORLanes 1024 1.072 Byte256Vector.XORLanes 1024 1.041 Short256Vector.ADDLanes 1024 1.017 Short256Vector.ANDLanes 1024 1.044 Short256Vector.MAXLanes 1024 1.049 Short256Vector.MINLanes 1024 1.049 Short256Vector.ORLanes 1024 1.089 Short256Vector.XORLanes 1024 1.047 Int256Vector.ADDLanes 1024 1.045 Int256Vector.ANDLanes 1024 1.078 Int256Vector.MAXLanes 1024 1.123 Int256Vector.MINLanes 1024 1.129 Int256Vector.ORLanes 1024 1.078 Int256Vector.XORLanes 1024 1.072 Long256Vector.ADDLanes 1024 1.059 Long256Vector.ANDLanes 1024 1.101 Long256Vector.MAXLanes 1024 1.079 Long256Vector.MINLanes 1024 1.099 Long256Vector.ORLanes 1024 1.098 Long256Vector.XORLanes 1024 1.110 Float256Vector.ADDLanes 1024 1.033 Float256Vector.MAXLanes 1024 1.156 Float256Vector.MINLanes 1024 1.151 Double256Vector.ADDLanes 1024 1.062 Double256Vector.MAXLanes 1024 1.145 Double256Vector.MINLanes 1024 1.140 This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: sxtw x14, w14 whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 ------------- Commit messages: - 8286941: Add mask IR for partial vector operations for ARM SVE Changes: https://git.openjdk.java.net/jdk/pull/9037/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9037&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286941 Stats: 2228 lines in 19 files changed: 811 ins; 920 del; 497 mod Patch: https://git.openjdk.java.net/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.java.net/jdk/pull/9037 From jbhateja at openjdk.java.net Mon Jun 6 10:44:50 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 6 Jun 2022 10:44:50 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'jdk:master' into JDK-8283667 > - Use integer constant for offsetInRange all the way through > - Rename "use_predicate" to "needs_predicate" > - Rename the "usePred" to "offsetInRange" > - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java line 97: > 95: public void byteLoadArrayMaskIOOBE() { > 96: for (int i = 0; i < inSize; i += bspecies.length()) { > 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i); For other case "if (offset >= 0 && offset <= (a.length - species.length())) )" we are anyways intrinsifying, should we limit this micro to work only for newly optimized case. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From fgao at openjdk.java.net Mon Jun 6 13:29:30 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 6 Jun 2022 13:29:30 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v7] In-Reply-To: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: > After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. > > After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). > > After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. > > Here is the test data (-XX:+UseSuperWord) on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Add assertion line for opcode() and withdraw some common code as a function Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe - Merge branch 'master' into fg8283091 Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 - Implement an interface for auto-vectorization to consult supported match rules Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 - Merge branch 'master' into fg8283091 Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd - Merge branch 'master' into fg8283091 Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f - Merge branch 'master' into fg8283091 Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 - Add micro-benchmark cases Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 - Merge branch 'master' into fg8283091 Change-Id: I674581135fd0844accc65520574fcef161eededa - 8283091: Support type conversion between different data sizes in SLP After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: int <-> double float <-> long int <-> long float <-> double A typical test case: int[] a; double[] b; for (int i = start; i < limit; i++) { b[i] = (double) a[i]; } Our expected OptoAssembly code for one iteration is like below: add R12, R2, R11, LShiftL #2 vector_load V16,[R12, #16] vectorcast_i2d V16, V16 # convert I to D vector add R11, R1, R11, LShiftL #3 # ptr add R13, R11, #16 # ptr vector_store [R13], V16 To enable the vectorization, the patch solves the following problems in the SLP. There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use. After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. Here is the test data on NEON: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op perf data on X86: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op perf data on AVX512: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef ------------- Changes: https://git.openjdk.java.net/jdk/pull/7806/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=06 Stats: 1282 lines in 22 files changed: 1223 ins; 13 del; 46 mod Patch: https://git.openjdk.java.net/jdk/pull/7806.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7806/head:pull/7806 PR: https://git.openjdk.java.net/jdk/pull/7806 From duke at openjdk.java.net Mon Jun 6 13:33:08 2022 From: duke at openjdk.java.net (openjdk-notifier[bot]) Date: Mon, 6 Jun 2022 13:33:08 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Thu, 2 Jun 2022 23:24:32 GMT, Fei Gao wrote: >>> . When we set `MaxVectorSize=16`, the case here would fail. All jtreg tests passed except that one. >> >> Do you mean it fail without #8961 or it fail always? > >> Do you mean it fail without #8961 or it fail always? > > @vnkozlov ,after [JDK-8286972](https://bugs.openjdk.java.net/browse/JDK-8286972), [the case](https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java#L723) would fail without https://github.com/openjdk/jdk/pull/8961. With it, the case passes. @fg1417 Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See [OpenJDK Developers? Guide](https://openjdk.java.net/guide/#working-with-pull-requests) for more information. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From fgao at openjdk.java.net Mon Jun 6 13:37:51 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 6 Jun 2022 13:37:51 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v7] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: <2pr7XCDhYeN9HZbLbn2P99IcEkfh6T5nZdg3ho-jFxI=.3cb07cf2-b557-4f07-94bb-d7ca18044931@github.com> On Mon, 6 Jun 2022 13:29:30 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Add assertion line for opcode() and withdraw some common code as a function > > Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe > - Merge branch 'master' into fg8283091 > > Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - 8283091: Support type conversion between different data sizes in SLP > > After JDK-8275317, C2's SLP vectorizer has supported type conversion > between the same data size. We can also support conversions between > different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems > in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and > StoreD. Assuming that the vector length is 128 bits, how many scalar > nodes should be packed together to a vector? If we decide it > separately for each operation node, like what we did before the patch > in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI > or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes > in a vector node sequence, like loading 4 elements to a vector, then > typecasting 2 elements and lastly storing these 2 elements, they become > invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function > SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. > In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then > generate valid vector node sequence, like loading 2 elements, > converting the 2 elements to another type and storing the 2 elements > with new type. > > After this, LoadI nodes don't make full use of the whole vector and > only occupy part of it. So we adapt the code in > SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each > scalar node in the whole vector. In this case, the alignments for 2 > LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. > Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which > mark that this node is the second node in the whole vector, while the > difference between 4 and 8 are just because of their own data sizes. In > this situation, we should try to remove the impact caused by different > data size in SLP. For example, in the stage of > SuperWord::extend_packlist(), while determining if it's potential to > pack a pair of def nodes in the function SuperWord::follow_use_defs(), > we remove the side effect of different data size by transforming the > target alignment from the use node. Because we believe that, assuming > that the vector length is 512 bits, if the ConvI2D use nodes have > alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, > these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type > conversion between different data size takes a type of one size and > produces a type of another size, hence the special checks on alignment > and size should be applied, like what we do in SuperWord::is_vector_use. > > After solving these problems, we successfully implemented the > vectorization of type conversion between different data sizes. > > Here is the test data on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef > // And exclude cases which are not profitable to auto-vectorize. Done. > Put it into a separate function because this code pattern is used 2 times. Done. > May be we should have assert here to make sure that in all places we call `VectorCastNode::opcode()` for `Conv*` nodes Done. Fixed the comments above and rebased to the latest JDK. All jtreg tests passed. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From fgao at openjdk.java.net Mon Jun 6 14:02:49 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 6 Jun 2022 14:02:49 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v7] In-Reply-To: <2pr7XCDhYeN9HZbLbn2P99IcEkfh6T5nZdg3ho-jFxI=.3cb07cf2-b557-4f07-94bb-d7ca18044931@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <2pr7XCDhYeN9HZbLbn2P99IcEkfh6T5nZdg3ho-jFxI=.3cb07cf2-b557-4f07-94bb-d7ca18044931@github.com> Message-ID: <7J_YrNCpwrbWXqXRpjdlLjosOlh1DlL06FytAxdR-E8=.20049f03-f580-41b4-96ea-a50f8d4f23fd@github.com> On Mon, 6 Jun 2022 13:32:57 GMT, Fei Gao wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: >> >> - Add assertion line for opcode() and withdraw some common code as a function >> >> Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 >> - Implement an interface for auto-vectorization to consult supported match rules >> >> Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 >> - Merge branch 'master' into fg8283091 >> >> Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 >> - Add micro-benchmark cases >> >> Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I674581135fd0844accc65520574fcef161eededa >> - 8283091: Support type conversion between different data sizes in SLP >> >> After JDK-8275317, C2's SLP vectorizer has supported type conversion >> between the same data size. We can also support conversions between >> different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems >> in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and >> StoreD. Assuming that the vector length is 128 bits, how many scalar >> nodes should be packed together to a vector? If we decide it >> separately for each operation node, like what we did before the patch >> in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI >> or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes >> in a vector node sequence, like loading 4 elements to a vector, then >> typecasting 2 elements and lastly storing these 2 elements, they become >> invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function >> SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. >> In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then >> generate valid vector node sequence, like loading 2 elements, >> converting the 2 elements to another type and storing the 2 elements >> with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and >> only occupy part of it. So we adapt the code in >> SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each >> scalar node in the whole vector. In this case, the alignments for 2 >> LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. >> Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which >> mark that this node is the second node in the whole vector, while the >> difference between 4 and 8 are just because of their own data sizes. In >> this situation, we should try to remove the impact caused by different >> data size in SLP. For example, in the stage of >> SuperWord::extend_packlist(), while determining if it's potential to >> pack a pair of def nodes in the function SuperWord::follow_use_defs(), >> we remove the side effect of different data size by transforming the >> target alignment from the use node. Because we believe that, assuming >> that the vector length is 512 bits, if the ConvI2D use nodes have >> alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, >> these two LoadI nodes should be packed as a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type >> conversion between different data size takes a type of one size and >> produces a type of another size, hence the special checks on alignment >> and size should be applied, like what we do in SuperWord::is_vector_use. >> >> After solving these problems, we successfully implemented the >> vectorization of type conversion between different data sizes. >> >> Here is the test data on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op >> >> Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef > >> // And exclude cases which are not profitable to auto-vectorize. > > Done. > >> Put it into a separate function because this code pattern is used 2 times. > > Done. > >> May be we should have assert here to make sure that in all places we call `VectorCastNode::opcode()` for `Conv*` nodes > > Done. > > Fixed the comments above and rebased to the latest JDK. All jtreg tests passed. > > Thanks. > @fg1417 Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See [OpenJDK Developers? Guide](https://openjdk.java.net/guide/#working-with-pull-requests) for more information. May I ask if I do anything wrong? I just rebased the master, resolved conflict and pushed a new commit as it guides... and did not do any force-push... Why I got the notification this time? ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From sviswanathan at openjdk.java.net Mon Jun 6 14:36:44 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 14:36:44 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 In-Reply-To: References: Message-ID: On Sun, 5 Jun 2022 01:42:40 GMT, Vladimir Kozlov wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > src/hotspot/cpu/x86/x86.ad line 1877: > >> 1875: if (is_integral_type(bt) && !VM_Version::supports_avx512dq()) { >> 1876: return false; >> 1877: } > > Overlapping conditions for the same types are confusing. I will add comments and rephrase the checks to make it clearer. > src/hotspot/cpu/x86/x86.ad line 1889: > >> 1887: return false; >> 1888: } >> 1889: if ((bt == T_LONG) && !VM_Version::supports_avx512dq()) { > > Again overlapping conditions. So T_LONG requires both: AVX512, avx512vl and avx512dq? > > What about T_INT? T_INT doesn't need AVX512dq. Float to long conversion (T_LONG) uses evcvttps2qq, which needs AVX512dq. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From psandoz at openjdk.java.net Mon Jun 6 15:44:50 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 6 Jun 2022 15:44:50 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'jdk:master' into JDK-8283667 > - Use integer constant for offsetInRange all the way through > - Rename "use_predicate" to "needs_predicate" > - Rename the "usePred" to "offsetInRange" > - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature Looks good. As a follow on PR I think it would be useful to add constants `OFFSET_IN_RANGE` and `OFFSET_OUT_OF_RANGE`, then it becomes much clearer in source and you can drop the `/* offsetInRange */` comment on the argument. ------------- Marked as reviewed by psandoz (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8035 From kvn at openjdk.java.net Mon Jun 6 20:35:52 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:35:52 GMT Subject: RFR: 8287840: Dead copy region node blocks IfNode's fold-compares In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 01:15:35 GMT, Xin Liu wrote: > IfNode::fold_compares() requires ctrl has a single output. I found some fold-compares case postpone to IterGVN2. The reason is that a dead region prevents IfNode::fold_compares() from transforming code. The dead node is removed in IterGVN, but it's too late. > > This PR extends Node::has_special_unique_user() so `PhaseIterGVN::remove_globally_dead_node()` puts IfNode back to worklist. The following attempt will carry out fold-compares(). Looks good. Let me test it. ------------- PR: https://git.openjdk.java.net/jdk/pull/9035 From kvn at openjdk.java.net Mon Jun 6 20:35:56 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:35:56 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 09:42:02 GMT, Xiaohong Gong wrote: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Changes I significant. I suggest to wait JDK 20 (next week). ------------- PR: https://git.openjdk.java.net/jdk/pull/9037 From kvn at openjdk.java.net Mon Jun 6 20:35:55 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:35:55 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 In-Reply-To: References: Message-ID: On Sat, 4 Jun 2022 22:13:32 GMT, Sandhya Viswanathan wrote: > Currently the C2 JIT only supports float -> int and double -> long conversion for x86. > This PR adds the support for following conversions in the c2 JIT: > float -> long, short, byte > double -> int, short, byte > > The performance gain is as follows. > Before the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms > > After the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms > > Please review. > > Best Regards, > Sandhya src/hotspot/cpu/x86/x86.ad line 1871: > 1869: break; > 1870: case Op_VectorCastD2X: > 1871: if (((UseAVX <= 2) || (!VM_Version::supports_avx512vl())) && Which asm instructions are required avx512vl? I don't see asserts in `assembler_x86.cpp` ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From kvn at openjdk.java.net Mon Jun 6 20:36:01 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:36:01 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 14:32:43 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/x86.ad line 1889: >> >>> 1887: return false; >>> 1888: } >>> 1889: if ((bt == T_LONG) && !VM_Version::supports_avx512dq()) { >> >> Again overlapping conditions. So T_LONG requires both: AVX512, avx512vl and avx512dq? >> >> What about T_INT? > > T_INT doesn't need AVX512dq. Float to long conversion (T_LONG) uses evcvttps2qq, which needs AVX512dq. Okay. I see that there are 2 instructions to support F2I by using avx or evex encoding. They cover all cases. No you are introducing sub_integer and long types only for evex encoding. You need comment that F2I is supported in all cases. For other integral types you need 512vl and additionally 512dq for T_LONG. Note, you don't need to check (UseAVX <= 2) because avx512vl bit is cleaned in such case. It is the same for VectorCastD2X code. In such case I suggest: if (is_subword_type(bt) && !VM_Version::supports_avx512vl() || (bt == T_LONG) && !VM_Version::supports_avx512vldq()) { return false; } ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From kvn at openjdk.java.net Mon Jun 6 20:36:03 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:36:03 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 In-Reply-To: References: Message-ID: On Sun, 5 Jun 2022 01:41:02 GMT, Vladimir Kozlov wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > src/hotspot/cpu/x86/x86.ad line 7298: > >> 7296: predicate(((VM_Version::supports_avx512vl() || >> 7297: Matcher::vector_length_in_bytes(n) == 64)) && >> 7298: is_integral_type(Matcher::vector_element_basic_type(n))); > > Do we need some of these conditions since you have them already in `match_rule_supported_vector()`? The predicate is not correct for all types this instruction is used now: it says that if size is 64 bytes you don't need avx512vl support for all types. Is it true? All this is very confusing. I suggest to keep original `castFtoI_reg_evex()` instruction as it was and use new `castFtoX_reg_evex()` only for T_LONG and sub_integer with new predicate `(type != T_INT)` and additional conditions if needed. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From duke at openjdk.java.net Mon Jun 6 20:39:23 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Mon, 6 Jun 2022 20:39:23 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v3] In-Reply-To: References: Message-ID: On Wed, 1 Jun 2022 08:55:16 GMT, Yuta Sato wrote: >> @JohnTortugo >> Thank you for your advice !! >> I added the name of the file to the warning message. > > After I consider it, it might be better not to add the name of the file to this warning. > If I look up code again, `Disassembler::load_library` checks all patterns of hsdis library > like I commented here (https://github.com/openjdk/jdk/pull/8782#issuecomment-1132489576). > Because of this, this warning message should be for telling that > "you failed to load all patterns of hsdis library". > So I reverted my last commit. NIT: "Failed to load hsdis library." or "Loading hsdis library failed." ------------- PR: https://git.openjdk.java.net/jdk/pull/8782 From xliu at openjdk.java.net Mon Jun 6 20:42:22 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Mon, 6 Jun 2022 20:42:22 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: monior change for code style. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/9c917371..81a8ccf9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=11 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=10-11 Stats: 15 lines in 3 files changed: 3 ins; 4 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Mon Jun 6 20:42:24 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:42:24 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v11] In-Reply-To: <9LEaNeG2c7dOaFkKn63VjFWt9N_T0wD90hUNt7e3M2E=.734048e7-03b1-4c53-a75c-db8bdc947656@github.com> References: <9LEaNeG2c7dOaFkKn63VjFWt9N_T0wD90hUNt7e3M2E=.734048e7-03b1-4c53-a75c-db8bdc947656@github.com> Message-ID: On Sun, 5 Jun 2022 22:59:13 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Bail out if fold-compares sees that a unstable_if trap has modified. > > Also add a regression test Update is good. I agree with avoiding folding if unc trap info was modified. I have few comments. src/hotspot/share/opto/compile.cpp line 1911: > 1909: > 1910: void Compile::preprocess_unstable_if_traps() { > 1911: #ifndef PRODUCT You can use `#ifndef PRODUCT` around all this code and use `PRODUCT_RETURN` macro in header file: `void preprocess_unstable_if_traps() PRODUCT_RETURN;` to het the same effect. src/hotspot/share/opto/compile.cpp line 1952: > 1950: const MethodLivenessResult& live_locals = method->liveness_at_bci(next_bci); > 1951: assert(live_locals.is_valid(), "broken liveness info"); > 1952: bool changed = false; Rename `changed` -> `modified` src/hotspot/share/opto/compile.cpp line 1961: > 1959: uint idx = jvms->locoff() + i; > 1960: #ifndef PRODUCT > 1961: if (Verbose) { `Verbose` is debug (develop) flag. Use `#ifdef ASSERT` here. src/hotspot/share/opto/compile.hpp line 811: > 809: _dead_node_count = 0; > 810: } > 811: void record_unstable_if(UnstableIfTrap* trap); You missed to rename method to record_unstable_if_trap(). The placement of declaration is strange. Can you move to other `unstable_if_trap` declared methods? ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Mon Jun 6 20:42:24 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Mon, 6 Jun 2022 20:42:24 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v10] In-Reply-To: <9o8fXgQUo5J0LvKlWkLq-xmR16XInT_xWCV8ruauD30=.4a6ad1af-ad96-4d34-aca1-4bb68cc96782@github.com> References: <9o8fXgQUo5J0LvKlWkLq-xmR16XInT_xWCV8ruauD30=.4a6ad1af-ad96-4d34-aca1-4bb68cc96782@github.com> Message-ID: On Sat, 4 Jun 2022 16:17:19 GMT, Vladimir Kozlov wrote: >> 2 tests failed so far. I put information into RFE. > >> 2 tests failed so far. I put information into RFE. > > No other new failures in my tier1-7 testing. I think after you address found issue it will be ready to integrate (after second review by other Reviewer). But I would suggest to push it into JDK 20 after 19 is forked in one week to get more testing before release. hi, @vnkozlov, I file a new issue JDK-8287840([PR](https://github.com/openjdk/jdk/pull/9035)) for the new regression. It will allow some IfNodes carry out fold-compares in 1st IterGVN. Back to this PR, for safety, I added a mechanism to guarantee that fold-compares gives up if its dominating trap has been modified. This is a guardrail and should not happen frequent. I also add a regression test to cover fold-compare cases. for `testEnumValues`, c2 would prefer fold-compares if we merged JDK-8287840. > But I would suggest to push it into JDK 20 after 19 is forked in one week to get more testing before release. sure. no problem. thanks, --lx ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Mon Jun 6 20:42:25 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Mon, 6 Jun 2022 20:42:25 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v11] In-Reply-To: References: <9LEaNeG2c7dOaFkKn63VjFWt9N_T0wD90hUNt7e3M2E=.734048e7-03b1-4c53-a75c-db8bdc947656@github.com> Message-ID: On Mon, 6 Jun 2022 17:08:46 GMT, Vladimir Kozlov wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> Bail out if fold-compares sees that a unstable_if trap has modified. >> >> Also add a regression test > > src/hotspot/share/opto/compile.hpp line 811: > >> 809: _dead_node_count = 0; >> 810: } >> 811: void record_unstable_if(UnstableIfTrap* trap); > > You missed to rename method to record_unstable_if_trap(). > The placement of declaration is strange. Can you move to other `unstable_if_trap` declared methods? yes, my fault. I rename it and group them. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Mon Jun 6 20:43:49 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:43:49 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v7] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Mon, 6 Jun 2022 13:29:30 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - Add assertion line for opcode() and withdraw some common code as a function > > Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe > - Merge branch 'master' into fg8283091 > > Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - 8283091: Support type conversion between different data sizes in SLP > > After JDK-8275317, C2's SLP vectorizer has supported type conversion > between the same data size. We can also support conversions between > different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems > in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and > StoreD. Assuming that the vector length is 128 bits, how many scalar > nodes should be packed together to a vector? If we decide it > separately for each operation node, like what we did before the patch > in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI > or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes > in a vector node sequence, like loading 4 elements to a vector, then > typecasting 2 elements and lastly storing these 2 elements, they become > invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function > SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. > In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then > generate valid vector node sequence, like loading 2 elements, > converting the 2 elements to another type and storing the 2 elements > with new type. > > After this, LoadI nodes don't make full use of the whole vector and > only occupy part of it. So we adapt the code in > SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each > scalar node in the whole vector. In this case, the alignments for 2 > LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. > Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which > mark that this node is the second node in the whole vector, while the > difference between 4 and 8 are just because of their own data sizes. In > this situation, we should try to remove the impact caused by different > data size in SLP. For example, in the stage of > SuperWord::extend_packlist(), while determining if it's potential to > pack a pair of def nodes in the function SuperWord::follow_use_defs(), > we remove the side effect of different data size by transforming the > target alignment from the use node. Because we believe that, assuming > that the vector length is 512 bits, if the ConvI2D use nodes have > alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, > these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type > conversion between different data size takes a type of one size and > produces a type of another size, hence the special checks on alignment > and size should be applied, like what we do in SuperWord::is_vector_use. > > After solving these problems, we successfully implemented the > vectorization of type conversion between different data sizes. > > Here is the test data on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef Latest changes look good. I have one comment about method name. These changes are related to #8877 which also limit vectors in auto-vectorizer. #8877 has priority since it fixed that issue. Also JDK 19 fork is coming in 3 day: https://openjdk.java.net/projects/jdk/19/ I would advice you to push these changes into JDK 20 after fork. I will start my testing and do approval based on results. But only webrev was affected so it is fine. src/hotspot/share/opto/matcher.hpp line 330: > 328: // Identify extra cases that we might want to vectorize automatically > 329: // And exclude cases which are not profitable to auto-vectorize. > 330: static const bool match_rule_supported_vectorization(int opcode, int vlen, BasicType bt); May be we should rename it to `match_rule_supported_superword`. That is how we call auto-vectorizer in C2. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From kvn at openjdk.java.net Mon Jun 6 20:43:50 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:43:50 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v7] In-Reply-To: <7J_YrNCpwrbWXqXRpjdlLjosOlh1DlL06FytAxdR-E8=.20049f03-f580-41b4-96ea-a50f8d4f23fd@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <2pr7XCDhYeN9HZbLbn2P99IcEkfh6T5nZdg3ho-jFxI=.3cb07cf2-b557-4f07-94bb-d7ca18044931@github.com> <7J_YrNCpwrbWXqXRpjdlLjosOlh1DlL06FytAxdR-E8=.20049f03-f580-41b4-96ea-a50f8d4f23fd@github.com> Message-ID: On Mon, 6 Jun 2022 13:59:28 GMT, Fei Gao wrote: > > @fg1417 Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See [OpenJDK Developers? Guide](https://openjdk.java.net/guide/#working-with-pull-requests) for more information. > > May I ask if I did anything wrong? I just rebased the master, resolved conflict and pushed a new commit as it guides... and did not do any force-push... Why I got the notification this time? Which `git` instruction you used? It is recommended to use `git merge master` in the PR's branch after `master` branch update (**8. Merge the latest changes**): Avoid rebasing changes, and prefer merging instead. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From kvn at openjdk.java.net Mon Jun 6 20:55:13 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 20:55:13 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v4] In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 17:49:04 GMT, Sandhya Viswanathan wrote: >> We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. >> The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. >> The performance regression is due to auto-vectorization of small loops. >> We don?t have AVX3Threshold consideration in auto-vectorization. >> The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. >> >> This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. >> >> Please review. >> >> Best Regard, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review comment resolution jbb2015 are only left in queue for performance testing. It may take time and I don't expect much variations in them. Testing also include `MaxVectorSize=32` to compare with current changes. It shows slightly (1-3%) better results in some `Crypto-AESBench_decrypt/encrypt` sub-benchmarks but it could be due to variations we observed in them. On other hand `SuperWordMaxVectorSize=32` shows better results in some Renaissance sub-benchmarks - actually it keep scores similar to current code and `MaxVectorSize=32` gives regression in them. Based on this I agree with current changes vs setting `MaxVectorSize=32`. Both changes gives 4-5% improvement to `SPECjvm2008-MPEG`. But I also observed 2.7% regression in `SPECjvm2008-SOR.small` with ParallelGC. For both types of changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From sviswanathan at openjdk.java.net Mon Jun 6 21:22:04 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 21:22:04 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v2] In-Reply-To: References: Message-ID: > Currently the C2 JIT only supports float -> int and double -> long conversion for x86. > This PR adds the support for following conversions in the c2 JIT: > float -> long, short, byte > double -> int, short, byte > > The performance gain is as follows. > Before the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms > > After the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Review comments resolution and cleanup ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/9032/files - new: https://git.openjdk.java.net/jdk/pull/9032/files/aa033e60..68130150 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=00-01 Stats: 85 lines in 3 files changed: 33 ins; 36 del; 16 mod Patch: https://git.openjdk.java.net/jdk/pull/9032.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9032/head:pull/9032 PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Mon Jun 6 21:22:07 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 21:22:07 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v2] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 19:04:56 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Review comments resolution and cleanup > > src/hotspot/cpu/x86/x86.ad line 1871: > >> 1869: break; >> 1870: case Op_VectorCastD2X: >> 1871: if (((UseAVX <= 2) || (!VM_Version::supports_avx512vl())) && > > Which asm instructions are required avx512vl? I don't see asserts in `assembler_x86.cpp` avx512vl support is needed only for vectors < 512 bit. I have corrected this in the predicate. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Mon Jun 6 21:24:52 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 21:24:52 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v2] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 18:54:33 GMT, Vladimir Kozlov wrote: >> src/hotspot/cpu/x86/x86.ad line 7298: >> >>> 7296: predicate(((VM_Version::supports_avx512vl() || >>> 7297: Matcher::vector_length_in_bytes(n) == 64)) && >>> 7298: is_integral_type(Matcher::vector_element_basic_type(n))); >> >> Do we need some of these conditions since you have them already in `match_rule_supported_vector()`? > > The predicate is not correct for all types this instruction is used now: it says that if size is 64 bytes you don't need avx512vl support for all types. Is it true? > > All this is very confusing. I suggest to keep original `castFtoI_reg_evex()` instruction as it was and use new `castFtoX_reg_evex()` only for T_LONG and sub_integer with new predicate `(type != T_INT)` and additional conditions if needed. Yes it was needed to select between the rules. On platforms that don't support avx512vl, we use AVX512 instructions only for 512-bit vectors and AVX instructions for < 64 byte vectors. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Mon Jun 6 21:32:46 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 21:32:46 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v3] In-Reply-To: References: Message-ID: > Currently the C2 JIT only supports float -> int and double -> long conversion for x86. > This PR adds the support for following conversions in the c2 JIT: > float -> long, short, byte > double -> int, short, byte > > The performance gain is as follows. > Before the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms > > After the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: cleanup predicate for f2x ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/9032/files - new: https://git.openjdk.java.net/jdk/pull/9032/files/68130150..ebf49d80 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/9032.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9032/head:pull/9032 PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Mon Jun 6 21:32:47 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 21:32:47 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v3] In-Reply-To: References: Message-ID: On Sun, 5 Jun 2022 01:44:34 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup predicate for f2x > > I assume it is support for "vector conversion". > > Please, add IR framework test. @vnkozlov I have implemented your review comments. The only item remaining is to add IR framework test. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From kvn at openjdk.java.net Mon Jun 6 22:19:11 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 22:19:11 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v3] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 21:32:46 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > cleanup predicate for f2x Looks good. Will wait IR test before testing and approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Mon Jun 6 23:24:15 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 23:24:15 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v4] In-Reply-To: References: Message-ID: > Currently the C2 JIT only supports float -> int and double -> long conversion for x86. > This PR adds the support for following conversions in the c2 JIT: > float -> long, short, byte > double -> int, short, byte > > The performance gain is as follows. > Before the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms > > After the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Add IR framework test case ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/9032/files - new: https://git.openjdk.java.net/jdk/pull/9032/files/ebf49d80..d44fca95 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=02-03 Stats: 223 lines in 1 file changed: 223 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/9032.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9032/head:pull/9032 PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Mon Jun 6 23:24:17 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 23:24:17 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v3] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 22:15:44 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup predicate for f2x > > Looks good. Will wait IR test before testing and approval. @vnkozlov I have added the IR framework test case. Please take a look. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Mon Jun 6 23:27:23 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 6 Jun 2022 23:27:23 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: > Currently the C2 JIT only supports float -> int and double -> long conversion for x86. > This PR adds the support for following conversions in the c2 JIT: > float -> long, short, byte > double -> int, short, byte > > The performance gain is as follows. > Before the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms > > After the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Fix extra space ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/9032/files - new: https://git.openjdk.java.net/jdk/pull/9032/files/d44fca95..996ee049 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/9032.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9032/head:pull/9032 PR: https://git.openjdk.java.net/jdk/pull/9032 From kvn at openjdk.java.net Mon Jun 6 23:49:13 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 23:49:13 GMT Subject: RFR: 8287840: Dead copy region node blocks IfNode's fold-compares In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 01:15:35 GMT, Xin Liu wrote: > IfNode::fold_compares() requires ctrl has a single output. I found some fold-compares case postpone to IterGVN2. The reason is that a dead region prevents IfNode::fold_compares() from transforming code. The dead node is removed in IterGVN, but it's too late. > > This PR extends Node::has_special_unique_user() so `PhaseIterGVN::remove_globally_dead_node()` puts IfNode back to worklist. The following attempt will carry out fold-compares(). Testing results are good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9035 From kvn at openjdk.java.net Mon Jun 6 23:55:06 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Jun 2022 23:55:06 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 23:27:23 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Fix extra space Good. I will start testing. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9032 From duke at openjdk.java.net Tue Jun 7 01:10:12 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Tue, 7 Jun 2022 01:10:12 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v4] In-Reply-To: References: Message-ID: > When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), > there is no warning message (only can see info level messages if put -Xlog:os=info). > This should show a warning message to tell the user that you failed to load libraries for hsdis. > So I put a warning message to notify this. > > e.g. > ` Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: change warning message ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8782/files - new: https://git.openjdk.java.net/jdk/pull/8782/files/0627c96c..58724253 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8782&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8782&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8782.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8782/head:pull/8782 PR: https://git.openjdk.java.net/jdk/pull/8782 From duke at openjdk.java.net Tue Jun 7 01:14:11 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Tue, 7 Jun 2022 01:14:11 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v4] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 20:08:44 GMT, Cesar Soares wrote: >> After I consider it, it might be better not to add the name of the file to this warning. >> If I look up code again, `Disassembler::load_library` checks all patterns of hsdis library >> like I commented here (https://github.com/openjdk/jdk/pull/8782#issuecomment-1132489576). >> Because of this, this warning message should be for telling that >> "you failed to load all patterns of hsdis library". >> So I reverted my last commit. > > NIT: "Failed to load hsdis library." or "Loading hsdis library failed." Thank you for your advice. I changed the warning message to "Loading hsdis library failed". ------------- PR: https://git.openjdk.java.net/jdk/pull/8782 From xgong at openjdk.java.net Tue Jun 7 01:24:05 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 7 Jun 2022 01:24:05 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 09:42:02 GMT, Xiaohong Gong wrote: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > Sure. It's fine for me to wait the JDK 20. Thanks a lot for the advice! ------------- PR: https://git.openjdk.java.net/jdk/pull/9037 From xgong at openjdk.java.net Tue Jun 7 02:26:13 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 7 Jun 2022 02:26:13 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 10:40:45 GMT, Jatin Bhateja wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'jdk:master' into JDK-8283667 >> - Use integer constant for offsetInRange all the way through >> - Rename "use_predicate" to "needs_predicate" >> - Rename the "usePred" to "offsetInRange" >> - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature > > test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java line 97: > >> 95: public void byteLoadArrayMaskIOOBE() { >> 96: for (int i = 0; i < inSize; i += bspecies.length()) { >> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i); > > For other case "if (offset >= 0 && offset <= (a.length - species.length())) )" we are anyways intrinsifying, should we limit this micro to work only for newly optimized case. Yeah, thanks and it's really a good suggestion to limit this benchmark only for the IOOBE cases. I locally modified the tests to make sure only the IOOBE case happens and the results show good as well. But do you think it's better to keep as it is since we can also see the performance of the common cases to make sure no regressions happen? As the current benchmarks can also show the performance gain by this PR. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From kvn at openjdk.java.net Tue Jun 7 03:07:07 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 7 Jun 2022 03:07:07 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 23:27:23 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Fix extra space Results are good. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9032 From xgong at openjdk.java.net Tue Jun 7 04:29:40 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 7 Jun 2022 04:29:40 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6] In-Reply-To: References: Message-ID: > Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, > byte[] a, int offset, > VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE - Merge branch 'jdk:master' into JDK-8283667 - Merge branch 'jdk:master' into JDK-8283667 - Use integer constant for offsetInRange all the way through - Rename "use_predicate" to "needs_predicate" - Rename the "usePred" to "offsetInRange" - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature ------------- Changes: https://git.openjdk.java.net/jdk/pull/8035/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=05 Stats: 453 lines in 44 files changed: 174 ins; 21 del; 258 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Tue Jun 7 04:31:55 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 7 Jun 2022 04:31:55 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 15:41:06 GMT, Paul Sandoz wrote: > Looks good. As a follow on PR I think it would be useful to add constants `OFFSET_IN_RANGE` and `OFFSET_OUT_OF_RANGE`, then it becomes much clearer in source and you can drop the `/* offsetInRange */` comment on the argument. Hi @PaulSandoz , thanks for the advice! I'v rebased the codes and added these two constants in the latest patch. Thanks again for the review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From duke at openjdk.java.net Tue Jun 7 04:39:08 2022 From: duke at openjdk.java.net (Yi-Fan Tsai) Date: Tue, 7 Jun 2022 04:39:08 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v2] In-Reply-To: References: Message-ID: <5PvYGuYNP9oMYb8F5RLWO7zRCihrxnIpZxLn7SantLc=.ad772b6f-03b7-4f17-9d30-50d06d575876@github.com> > 8263377: Store method handle linkers in the 'non-nmethods' heap Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: Remove dead codes remove unused argument of NativeJump::check_verified_entry_alignment remove unused argument of NativeJumip::patch_verified_entry remove dead codes in SharedRuntime::generate_method_handle_intrinsic_wrapper ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8760/files - new: https://git.openjdk.java.net/jdk/pull/8760/files/63771d64..00c99435 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8760&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8760&range=00-01 Stats: 79 lines in 29 files changed: 3 ins; 33 del; 43 mod Patch: https://git.openjdk.java.net/jdk/pull/8760.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8760/head:pull/8760 PR: https://git.openjdk.java.net/jdk/pull/8760 From xliu at openjdk.java.net Tue Jun 7 05:31:13 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 7 Jun 2022 05:31:13 GMT Subject: RFR: 8287840: Dead copy region node blocks IfNode's fold-compares In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 23:45:41 GMT, Vladimir Kozlov wrote: >> IfNode::fold_compares() requires ctrl has a single output. I found some fold-compares case postpone to IterGVN2. The reason is that a dead region prevents IfNode::fold_compares() from transforming code. The dead node is removed in IterGVN, but it's too late. >> >> This PR extends Node::has_special_unique_user() so `PhaseIterGVN::remove_globally_dead_node()` puts IfNode back to worklist. The following attempt will carry out fold-compares(). > > Testing results are good. @vnkozlov , Thank you for reviewing this. I found this issue for this method: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/time/temporal/ChronoField.java#L687 Without this change, c2 processes the IfNode in IterGVN2 after CCP because `PhaseCCP::transform_once` put all IfNode to worklist. I think it is good idea to get one thing done in one pass of IterGVN. thanks, --lx ------------- PR: https://git.openjdk.java.net/jdk/pull/9035 From jbhateja at openjdk.java.net Tue Jun 7 06:46:15 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 7 Jun 2022 06:46:15 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 02:22:53 GMT, Xiaohong Gong wrote: >> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java line 97: >> >>> 95: public void byteLoadArrayMaskIOOBE() { >>> 96: for (int i = 0; i < inSize; i += bspecies.length()) { >>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i); >> >> For other case "if (offset >= 0 && offset <= (a.length - species.length())) )" we are anyways intrinsifying, should we limit this micro to work only for newly optimized case. > > Yeah, thanks and it's really a good suggestion to limit this benchmark only for the IOOBE cases. I locally modified the tests to make sure only the IOOBE case happens and the results show good as well. But do you think it's better to keep as it is since we can also see the performance of the common cases to make sure no regressions happen? As the current benchmarks can also show the performance gain by this PR. It was just to remove the noise from a targeted micro benchmark. But we can keep it as it is. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Tue Jun 7 06:46:16 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 7 Jun 2022 06:46:16 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 06:41:36 GMT, Jatin Bhateja wrote: >> Yeah, thanks and it's really a good suggestion to limit this benchmark only for the IOOBE cases. I locally modified the tests to make sure only the IOOBE case happens and the results show good as well. But do you think it's better to keep as it is since we can also see the performance of the common cases to make sure no regressions happen? As the current benchmarks can also show the performance gain by this PR. > > It was just to remove the noise from a targeted micro benchmark. But we can keep it as it is. OK, thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Tue Jun 7 07:42:22 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 7 Jun 2022 07:42:22 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > Yes, the tests were run in debug mode. The reporting of the missing constant occurs for the compiled method that is called from the method where the constants are declared e.g.: > > 719 240 b jdk.incubator.vector.Int256Vector::fromArray0 (15 bytes) > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it > ** missing constant: offsetInRange=Parm > @ 11 jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline by annotation > > > So it appears to be working as expected. A similar pattern occurs at a lower-level for the passing of the mask class. `Int256Vector::fromArray0` passes a constant class to `IntVector::fromArray0Template` (the compilation of which bails out before checking that the `offsetInRange` is constant). Thanks for the review @PaulSandoz @sviswa7 @jatin-bhateja @merykitty ? ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From jbhateja at openjdk.java.net Tue Jun 7 07:42:20 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 7 Jun 2022 07:42:20 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 04:29:40 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE > - Merge branch 'jdk:master' into JDK-8283667 > - Merge branch 'jdk:master' into JDK-8283667 > - Use integer constant for offsetInRange all the way through > - Rename "use_predicate" to "needs_predicate" > - Rename the "usePred" to "offsetInRange" > - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature LGTM. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Tue Jun 7 07:45:23 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 7 Jun 2022 07:45:23 GMT Subject: Integrated: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature In-Reply-To: References: Message-ID: On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong wrote: > Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, > byte[] a, int offset, > VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. This pull request has now been integrated. Changeset: 39fa52b5 Author: Xiaohong Gong URL: https://git.openjdk.java.net/jdk/commit/39fa52b5f7504eca7399b863b0fb934bdce37f7e Stats: 453 lines in 44 files changed: 174 ins; 21 del; 258 mod 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature Reviewed-by: sviswanathan, psandoz ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From thartmann at openjdk.java.net Tue Jun 7 08:10:07 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 08:10:07 GMT Subject: RFR: 8286967: Unproblemlist compiler/c2/irTests/TestSkeletonPredicates.java and add additional test for JDK-8286638 In-Reply-To: References: Message-ID: <_cdTL_xBYAx967IVXjqfpA9tJ5HrBjzduwTStmSu6N0=.c58fc767-7106-40ab-bc13-ecbb17f76ba2@github.com> On Fri, 20 May 2022 09:47:57 GMT, Christian Hagedorn wrote: > [JDK-8286361](https://bugs.openjdk.java.net/browse/JDK-8286361) could be traced back to the same underlying problem as in [JDK-8286638](https://bugs.openjdk.java.net/browse/JDK-8286638). Pulling in the change fixed the problem. > > This patch unproblemlists the previously failing test and adds a new test for JDK-8286638 (extracted from compiler/c2/irTests/TestSkeletonPredicates.java) that I've used for analyzing JDK-8286361. > > Testing with latest JDK: > - hs-tier1-4 flags for the new test > - hs-tier7+8 flags for compiler/c2/irTests/TestSkeletonPredicates.java > > Thanks, > Christian Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8806 From thartmann at openjdk.java.net Tue Jun 7 08:17:07 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 08:17:07 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: On Thu, 12 May 2022 07:30:39 GMT, Erik ?sterlund wrote: > The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. > If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. Looks good to me. A comment explaining the justification wouldn't hurt. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8673 From thartmann at openjdk.java.net Tue Jun 7 08:33:06 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 08:33:06 GMT Subject: RFR: 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true In-Reply-To: References: Message-ID: On Wed, 25 May 2022 08:17:17 GMT, Christian Hagedorn wrote: > Currently, the IR framework always tries to install the Whitebox by moving the Whitebox class file to the JTreg class path. However, when a test already builds the Whitebox and uses it as part of the test, we cannot access it on certain platforms. On Windows, for example, we'll get the following exception: > > Caused by: java.nio.file.FileSystemException: sun\hotspot\WhiteBox.class: The process cannot access the file because it is being used by another process > > To mitigate this problem, one can specify `-DSkipWhiteBoxInstall=true` which was already done in [JDK-8283187](https://bugs.openjdk.java.net/browse/JDK-8283187). But this is not a good solution as the user should not need to worry about the inner workings of the IR framework. > > I propose to get rid of this flag by reworking the Whitebox installation process. > > Thanks, > Christian Looks good. test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java line 345: > 343: > 344: /** > 345: * Try to load the Whitebox class with a user directory custom class loader. If the user has already built the Suggestion: * Try to load the Whitebox class from the user directory with a custom class loader. If the user has already built the ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8879 From chagedorn at openjdk.java.net Tue Jun 7 08:35:14 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 08:35:14 GMT Subject: RFR: 8286967: Unproblemlist compiler/c2/irTests/TestSkeletonPredicates.java and add additional test for JDK-8286638 In-Reply-To: References: Message-ID: On Fri, 20 May 2022 09:47:57 GMT, Christian Hagedorn wrote: > [JDK-8286361](https://bugs.openjdk.java.net/browse/JDK-8286361) could be traced back to the same underlying problem as in [JDK-8286638](https://bugs.openjdk.java.net/browse/JDK-8286638). Pulling in the change fixed the problem. > > This patch unproblemlists the previously failing test and adds a new test for JDK-8286638 (extracted from compiler/c2/irTests/TestSkeletonPredicates.java) that I've used for analyzing JDK-8286361. > > Testing with latest JDK: > - hs-tier1-4 flags for the new test > - hs-tier7+8 flags for compiler/c2/irTests/TestSkeletonPredicates.java > > Thanks, > Christian Thanks Vladimir and Tobias for your reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/8806 From chagedorn at openjdk.java.net Tue Jun 7 08:37:31 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 08:37:31 GMT Subject: Integrated: 8286967: Unproblemlist compiler/c2/irTests/TestSkeletonPredicates.java and add additional test for JDK-8286638 In-Reply-To: References: Message-ID: <8JR375j64DfdmhY7RcQx2K19SJXzHiYd03hR1ddqxqw=.3f9c4f23-c299-4083-b911-4956edd875c0@github.com> On Fri, 20 May 2022 09:47:57 GMT, Christian Hagedorn wrote: > [JDK-8286361](https://bugs.openjdk.java.net/browse/JDK-8286361) could be traced back to the same underlying problem as in [JDK-8286638](https://bugs.openjdk.java.net/browse/JDK-8286638). Pulling in the change fixed the problem. > > This patch unproblemlists the previously failing test and adds a new test for JDK-8286638 (extracted from compiler/c2/irTests/TestSkeletonPredicates.java) that I've used for analyzing JDK-8286361. > > Testing with latest JDK: > - hs-tier1-4 flags for the new test > - hs-tier7+8 flags for compiler/c2/irTests/TestSkeletonPredicates.java > > Thanks, > Christian This pull request has now been integrated. Changeset: dbf0905f Author: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/dbf0905ff4ad6c831095278fc47c3a6354fe3bc1 Stats: 79 lines in 2 files changed: 77 ins; 2 del; 0 mod 8286967: Unproblemlist compiler/c2/irTests/TestSkeletonPredicates.java and add additional test for JDK-8286638 Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8806 From chagedorn at openjdk.java.net Tue Jun 7 08:41:14 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 08:41:14 GMT Subject: RFR: 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true [v2] In-Reply-To: References: Message-ID: > Currently, the IR framework always tries to install the Whitebox by moving the Whitebox class file to the JTreg class path. However, when a test already builds the Whitebox and uses it as part of the test, we cannot access it on certain platforms. On Windows, for example, we'll get the following exception: > > Caused by: java.nio.file.FileSystemException: sun\hotspot\WhiteBox.class: The process cannot access the file because it is being used by another process > > To mitigate this problem, one can specify `-DSkipWhiteBoxInstall=true` which was already done in [JDK-8283187](https://bugs.openjdk.java.net/browse/JDK-8283187). But this is not a good solution as the user should not need to worry about the inner workings of the IR framework. > > I propose to get rid of this flag by reworking the Whitebox installation process. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8879/files - new: https://git.openjdk.java.net/jdk/pull/8879/files/0b0b7d37..8f2a7c70 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8879&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8879&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8879.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8879/head:pull/8879 PR: https://git.openjdk.java.net/jdk/pull/8879 From chagedorn at openjdk.java.net Tue Jun 7 08:41:15 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 08:41:15 GMT Subject: RFR: 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true In-Reply-To: References: Message-ID: On Wed, 25 May 2022 08:17:17 GMT, Christian Hagedorn wrote: > Currently, the IR framework always tries to install the Whitebox by moving the Whitebox class file to the JTreg class path. However, when a test already builds the Whitebox and uses it as part of the test, we cannot access it on certain platforms. On Windows, for example, we'll get the following exception: > > Caused by: java.nio.file.FileSystemException: sun\hotspot\WhiteBox.class: The process cannot access the file because it is being used by another process > > To mitigate this problem, one can specify `-DSkipWhiteBoxInstall=true` which was already done in [JDK-8283187](https://bugs.openjdk.java.net/browse/JDK-8283187). But this is not a good solution as the user should not need to worry about the inner workings of the IR framework. > > I propose to get rid of this flag by reworking the Whitebox installation process. > > Thanks, > Christian Thanks Tobias for your review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8879 From chagedorn at openjdk.java.net Tue Jun 7 08:41:16 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 08:41:16 GMT Subject: Integrated: 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true In-Reply-To: References: Message-ID: On Wed, 25 May 2022 08:17:17 GMT, Christian Hagedorn wrote: > Currently, the IR framework always tries to install the Whitebox by moving the Whitebox class file to the JTreg class path. However, when a test already builds the Whitebox and uses it as part of the test, we cannot access it on certain platforms. On Windows, for example, we'll get the following exception: > > Caused by: java.nio.file.FileSystemException: sun\hotspot\WhiteBox.class: The process cannot access the file because it is being used by another process > > To mitigate this problem, one can specify `-DSkipWhiteBoxInstall=true` which was already done in [JDK-8283187](https://bugs.openjdk.java.net/browse/JDK-8283187). But this is not a good solution as the user should not need to worry about the inner workings of the IR framework. > > I propose to get rid of this flag by reworking the Whitebox installation process. > > Thanks, > Christian This pull request has now been integrated. Changeset: b647a125 Author: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/b647a1259b543aaf7d9943fc21971b4125640376 Stats: 36 lines in 5 files changed: 28 ins; 1 del; 7 mod 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8879 From thartmann at openjdk.java.net Tue Jun 7 09:24:10 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 09:24:10 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2] In-Reply-To: References: Message-ID: On Fri, 6 May 2022 09:18:27 GMT, Roland Westrelin wrote: >> This is another small enhancement for a code shape that showed up in a >> MemorySegment micro benchmark. The shape to optimize is the one from test1: >> >> >> for (int i = 0; i < size; i++) { >> long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> if (((base + j) & 3) != 0) { >> throw new RuntimeException(); >> } >> >> v += UNSAFE.getInt(base + j); >> } >> >> >> In that code shape, the loop iv is first scaled, result is then casted >> to long, range checked and finally address of memory location is >> computed. >> >> The alignment check is transformed so the loop body has no check In >> order to eliminate the range check, that loop is transformed into: >> >> >> for (int i1 = ..) { >> for (int i2 = ..) { >> long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> v += UNSAFE.getInt(base + j); >> } >> } >> >> >> The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... >> >> In this case, the type of the ConvI2L is [min_jint, max_jint] and type >> of CastLL is [0, max_jint] (the CastLL has a narrower type). >> >> I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that >> case. The convI2L and CastII types can be set to [0, max_jint]. The >> new address shape is then: >> >> (AddP base (ConvI2L (CastII (LShiftI (AddI ... >> >> which optimize well. >> >> (LShiftI (AddI ... >> is transformed into >> (AddI (LShiftI ... >> because one of the AddI input is loop invariant (i2) and we have: >> >> (AddP base (ConvI2L (CastII (AddI (LShiftI ... >> >> Then because the ConvI2L and CastII types are [0, max_jint], the AddI >> is pushed through the ConvI2L and CastII: >> >> (AddP base (AddL (ConvI2L (CastII (LShiftI ... >> >> base and one of the inputs of the AddL are loop invariant so this >> transformed into: >> >> (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... >> >> The (AddP ...) is loop invariant so computed before entry. The >> (ConvI2L ...) only depends on the loop iv. >> >> The resulting address is a shift + an add. The address before >> transformation requires 2 adds + a shift. Also after unrolling, the >> adress of the second access in the loop is cheaper to compute as it >> can be derived from the address of the first access. >> >> For all of this to work: >> 1) I added a CastLL::Ideal transformation: >> (CastLL (ConvI2L into (ConvI2l (CastII >> >> 2) I also had to prevent split if to transform (LShiftI (Phi for the >> iv Phi of a counted loop. >> >> >> test2 and test3 test 1) and 2) separately. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Looks good to me but I'm wondering if we should delay this to JDK 20 as we are late for the JDK 19 release and had many issues with Cast/ConvNodes before. src/hotspot/share/opto/castnode.cpp line 374: > 372: #endif > 373: > 374: Node *CastLLNode::Ideal(PhaseGVN *phase, bool can_reshape) { Suggestion: Node* CastLLNode::Ideal(PhaseGVN* phase, bool can_reshape) { src/hotspot/share/opto/castnode.hpp line 121: > 119: } > 120: > 121: virtual Node *Ideal(PhaseGVN *phase, bool can_reshape); Suggestion: virtual Node* Ideal(PhaseGVN* phase, bool can_reshape); test/hotspot/jtreg/compiler/c2/irTests/TestConvI2LCastLongLoop.java line 34: > 32: /* > 33: * @test > 34: * @bug 8286197 Suggestion: * @bug 8286197 * @key randomness ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8555 From thartmann at openjdk.java.net Tue Jun 7 09:26:06 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 09:26:06 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive [v3] In-Reply-To: References: Message-ID: On Tue, 31 May 2022 03:22:46 GMT, Yuta Sato wrote: >> When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , >> it shows totally the same message for c1 and c2 compiler and the user would be confused about >> which compiler is affected by this message. >> This should show messages with their compiler name so that the user knows which compiler shows this message. >> >> My change result would be like the below. >> >> >> OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> >> -> >> >> OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > > Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: > > add const to method Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8591 From thartmann at openjdk.java.net Tue Jun 7 09:49:12 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 09:49:12 GMT Subject: RFR: 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect In-Reply-To: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> References: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> Message-ID: On Thu, 2 Jun 2022 15:43:09 GMT, Roland Westrelin wrote: > It's another case where because of overunrolling, the main loop is > never executed but not optimized out and the type of some > CastII/ConvI2L for a range check conflicts with the type of its input > resulting in a broken graph for the main loop. > > This is supposed to have been solved by skeleton predicates. There's > indeed a predicate that should catch that the loop is unreachable but > it doesn't constant fold. The shape of the predicate is: > > (CmpUL (SubL 15 (ConvI2L (AddI (CastII int:>=1) 15) minint..maxint)) 16) > > I propose adding a CastII, that is in this case: > > (CmpUL (SubL 15 (ConvI2L (CastII (AddI (CastII int:>=1) 15) 0..max-1) minint..maxint)) 16) > > The justification for the CastII is that the skeleton predicate is a > predicate for a specific iteration of the loop. That iteration of the > loop must be in the range of the iv Phi. > > With the extra CastII, the AddI can be pushed through the CastII and > ConvI2L and the check constant folds. Actually, with the extra CastII, > the predicate is not implemented with a CmpUL but a CmpU because the > code can tell there's no risk of overflow (I did force the use of > CmpUL as an experiment and the CmpUL does constant fold) Looks reasonable to me. I submitted testing and will report back once it passed. test/hotspot/jtreg/compiler/loopopts/TestOverUnrolling2.java line 26: > 24: /* > 25: * @test > 26: * @bug 8286625 Suggestion: * @bug 8286625 * @key stress ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8996 From thartmann at openjdk.java.net Tue Jun 7 10:08:04 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 10:08:04 GMT Subject: RFR: 8286451: C2: assert(nb == 1) failed: only when the head is not shared In-Reply-To: References: Message-ID: On Mon, 30 May 2022 13:50:08 GMT, Roland Westrelin wrote: > nb counts the number of loops that share a single head. The assert > that fires is in code that handles the case of a self loop (a loop > composed of a single block). There can be a self loop and multiple > loops that share a head: the assert makes little sense and I propose > to simply remove it. > > I think there's another issue with this code: in the case of a self > loop and multiple loops that share a head, the self loop can be any of > the loop for which the head is cloned not only the one that's passed > as argument to ciTypeFlow::clone_loop_head(). As a consequence, I > moved the logic for self loops in the loop that's applied to all loops > that share the loop head. Looks reasonable to me. I submitted testing and will report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8947 From thartmann at openjdk.java.net Tue Jun 7 10:30:06 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 10:30:06 GMT Subject: RFR: 8287840: Dead copy region node blocks IfNode's fold-compares In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 01:15:35 GMT, Xin Liu wrote: > IfNode::fold_compares() requires ctrl has a single output. I found some fold-compares case postpone to IterGVN2. The reason is that a dead region prevents IfNode::fold_compares() from transforming code. The dead node is removed in IterGVN, but it's too late. > > This PR extends Node::has_special_unique_user() so `PhaseIterGVN::remove_globally_dead_node()` puts IfNode back to worklist. The following attempt will carry out fold-compares(). Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9035 From thartmann at openjdk.java.net Tue Jun 7 11:06:22 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 11:06:22 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v28] In-Reply-To: <6fvXf0Lpbpbs0fQXNQLimROBnrAUIfJrUoHv3Fd7AkE=.0375bdec-8c24-4f30-9ccc-17095ed973ec@github.com> References: <6fvXf0Lpbpbs0fQXNQLimROBnrAUIfJrUoHv3Fd7AkE=.0375bdec-8c24-4f30-9ccc-17095ed973ec@github.com> Message-ID: On Thu, 2 Jun 2022 14:48:36 GMT, Emanuel Peter wrote: >> **What this gives you for the debugger** >> - BFS traversal (inputs / outputs) >> - node filtering by category >> - shortest path between nodes >> - all paths between nodes >> - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) >> - and more >> >> **Some usecases** >> - more readable `dump` >> - follow only nodes of some categories (only control, only data, etc) >> - find which control nodes depend on data node (visit data nodes, include control in boundary) >> - how two nodes relate (shortest / all paths, following input/output nodes, or both) >> - find loops (control / memory / data: call all paths with node as start and target) >> >> **Description** >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. >> >> `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` >> >> To get familiar with the many options, run this to get help: >> `find_node(0)->dump_bfs(0,0,"h")` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. >> >> **Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. >> 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. >> 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. >> 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. >> 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. >> 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. >> 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! >> >> Example (BFS inputs): >> >> (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") >> d dump >> --------------------------------------------- >> 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> >> >> Example (BFS control inputs): >> >> (rr) p find_node(163)->dump_bfs(5,0,"c+") >> d dump >> --------------------------------------------- >> 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 >> >> We see the control flow of a strip mined loop. >> >> >> Experiment (BFS only data, but display all nodes on boundary) >> >> (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") >> d dump >> --------------------------------------------- >> 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) >> 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] >> >> We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). >> >> Example with Mach nodes: >> >> (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") >> d [head idom d] old dump >> --------------------------------------------- >> 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd >> 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) >> 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] >> 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] >> 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> And the query on the old nodes: >> >> (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") >> d dump >> --------------------------------------------- >> 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] >> 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] >> 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 >> 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] >> 1 o740 Bool === _ o739 [[ o741 ]] [lt] >> 1 o179 IfTrue === o178 [[ o741 ]] #1 >> 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 >> >> >> **Exploring loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "c+")` >> This provides us with a shortest control path, given this path has a distance of at most 20. >> >> Example (shortest path over control nodes): >> >> (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") >> d dump >> --------------------------------------------- >> 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> >> Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. >> >> Example (all paths between two nodes): >> >> (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. >> >> An alternative to detect loops quickly, is running an all paths query from a node to itself: >> >> Example (loop detection with all paths): >> >> (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") >> d apd dump >> --------------------------------------------- >> 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) >> 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We get the loop control, plus the loop-back `190 IfTrue`. >> >> Example (loop detection with all paths for phi): >> >> (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> >> >> **Color examples** >> Colors are especially useful to see chains between nodes (options character `#`). >> The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. >> Tip: it can be worth it to configure the colors of your terminal to be more appealing. >> >> Example (find control dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) >> We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. >> >> Example (find memory dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) >> >> Example (loop detection): >> ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) >> We find the control and some data loop paths. > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - missing style thing from last commit > - another one of Christian's reviews Looks great! src/hotspot/share/opto/node.cpp line 1980: > 1978: tty->print("Usage: node->dump_bfs(int max_distance, Node* target, char* options)\n"); > 1979: tty->print("\n"); > 1980: tty->print("Usecases:\n"); Suggestion: tty->print("Use cases:\n"); src/hotspot/share/opto/node.cpp line 2019: > 2017: tty->print("output columns:\n"); > 2018: tty->print(" d: BFS distance to this/start\n"); > 2019: tty->print(" adp: all paths distance (d_start + d_target)\n"); Suggestion: tty->print(" apd: all paths distance (d_start + d_target)\n"); src/hotspot/share/opto/node.cpp line 2020: > 2018: tty->print(" d: BFS distance to this/start\n"); > 2019: tty->print(" adp: all paths distance (d_start + d_target)\n"); > 2020: tty->print(" block: block block in which the node has been scheduled [head(), _idom->head(), _dom_depth]\n"); Suggestion: tty->print(" block: block in which the node has been scheduled [head(), _idom->head(), _dom_depth]\n"); src/hotspot/share/opto/node.cpp line 2042: > 2040: tty->print(" display old nodes and blocks, if they exist\n"); > 2041: tty->print(" useful call to start with\n"); > 2042: tty->print(" find_node(102)->dump_bfs(10,0,\"dCDMOX-\")\n"); Suggestion: tty->print(" find_node(102)->dump_bfs(10, 0, "dCDMOX-")\n"); src/hotspot/share/opto/node.cpp line 2044: > 2042: tty->print(" find_node(102)->dump_bfs(10,0,\"dCDMOX-\")\n"); > 2043: tty->print(" find non-data dependencies of a data node\n"); > 2044: tty->print(" follow data node outputs until find another category\n"); Suggestion: tty->print(" follow data node outputs until we find another category\n"); src/hotspot/share/opto/node.cpp line 2050: > 2048: tty->print(" will not find a path if it is longer than 10\n"); > 2049: tty->print(" useful to find how x and y are related\n"); > 2050: tty->print(" find_node(741)->dump_bfs(20,find_node(746),\"c+\")\n"); Suggestion: tty->print(" find_node(741)->dump_bfs(20, find_node(746), "c+")\n"); src/hotspot/share/opto/node.cpp line 2051: > 2049: tty->print(" useful to find how x and y are related\n"); > 2050: tty->print(" find_node(741)->dump_bfs(20,find_node(746),\"c+\")\n"); > 2051: tty->print(" find shortest control path between two nodes\n"); Suggestion: tty->print(" find shortest control path between two nodes\n"); src/hotspot/share/opto/node.cpp line 2052: > 2050: tty->print(" find_node(741)->dump_bfs(20,find_node(746),\"c+\")\n"); > 2051: tty->print(" find shortest control path between two nodes\n"); > 2052: tty->print(" find_node(741)->dump_bfs(8,find_node(746),\"cdmxo+A\")\n"); Suggestion: tty->print(" find_node(741)->dump_bfs(8, find_node(746), "cdmxo+A")\n"); src/hotspot/share/opto/node.cpp line 2054: > 2052: tty->print(" find_node(741)->dump_bfs(8,find_node(746),\"cdmxo+A\")\n"); > 2053: tty->print(" find all paths (A) between two nodes of length at most 8\n"); > 2054: tty->print(" find_node(741)->dump_bfs(7,find_node(741),\"c+A\")\n"); Suggestion: tty->print(" find_node(741)->dump_bfs(7, find_node(741), "c+A")\n"); ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8468 From epeter at openjdk.java.net Tue Jun 7 11:17:22 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Tue, 7 Jun 2022 11:17:22 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v29] In-Reply-To: References: Message-ID: > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > To get familiar with the many options, run this to get help: > `find_node(0)->dump_bfs(0,0,"h")` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **Exploring loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review from @TobiHartmann Thank you @TobiHartmann Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/63e25056..abd53b02 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=28 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=27-28 Stats: 9 lines in 1 file changed: 0 ins; 0 del; 9 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From chagedorn at openjdk.java.net Tue Jun 7 12:24:45 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 12:24:45 GMT Subject: RFR: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node Message-ID: When intrisifying `java.lang.Thread::currentThread()`, we are creating an `AddP` node that has the `top` node as base to indicate that we do not have an oop (using `NULL` instead leads to crashes as it does not seem to be expected to have a `NULL` base): https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/library_call.cpp#L904 This node is used on a chain of data nodes into two `MemBarAcquire` nodes as precedence edge in the test case: ![Screenshot from 2022-06-07 11-12-38](https://user-images.githubusercontent.com/17833009/172344751-5338b72f-baa5-4e9e-a44c-6d970798d9f2.png) Later, in `final_graph_reshaping_impl()`, we are removing the precedence edge of both `MemBarAcquire` nodes and clean up all now dead nodes as a result of the removal: https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/compile.cpp#L3655-L3679 We iteratively call `disconnect_inputs()` for all nodes that have no output anymore (i.e. dead nodes). This code, however, also treats the `top` node as dead since `outcnt()` of `top` is always zero: https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/node.hpp#L495-L500 And we end up disconnecting `top` which results in the assertion failure. The code misses a check for `top()`. I suggest to add this check before processing a node for which `outcnt()` is zero. This is a pattern which can also be found in other places in the code. I've checked all other usages of `oucnt() == 0` and could not find a case where this additional `top()` check is missing. Maybe we should refactor these two checks into a single method at some point to not need to worry about `top` anymore in the future when checking if a node is dead based on the outputs. Thanks, Christian ------------- Commit messages: - 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node Changes: https://git.openjdk.java.net/jdk/pull/9060/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9060&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287432 Stats: 56 lines in 2 files changed: 55 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/9060.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9060/head:pull/9060 PR: https://git.openjdk.java.net/jdk/pull/9060 From chagedorn at openjdk.java.net Tue Jun 7 12:44:07 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 12:44:07 GMT Subject: RFR: 8286451: C2: assert(nb == 1) failed: only when the head is not shared In-Reply-To: References: Message-ID: On Mon, 30 May 2022 13:50:08 GMT, Roland Westrelin wrote: > nb counts the number of loops that share a single head. The assert > that fires is in code that handles the case of a self loop (a loop > composed of a single block). There can be a self loop and multiple > loops that share a head: the assert makes little sense and I propose > to simply remove it. > > I think there's another issue with this code: in the case of a self > loop and multiple loops that share a head, the self loop can be any of > the loop for which the head is cloned not only the one that's passed > as argument to ciTypeFlow::clone_loop_head(). As a consequence, I > moved the logic for self loops in the loop that's applied to all loops > that share the loop head. Looks good! > nb counts the number of loops that share a single head Maybe you also want to rename the variable to make it more clear what it counts. test/hotspot/jtreg/compiler/ciTypeFlow/TestSharedLoopHead.java line 28: > 26: * @bug 8286451 > 27: * @summary C2: assert(nb == 1) failed: only when the head is not shared > 28: * @run main/othervm TestSharedLoopHead Could be converted to `@run driver TestSharedLoopHead` instead. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8947 From redestad at openjdk.java.net Tue Jun 7 12:44:42 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Tue, 7 Jun 2022 12:44:42 GMT Subject: RFR: 8287903: Reduce runtime of java.math microbenchmarks Message-ID: - Reduce runtime by running fewer forks, fewer iterations, less warmup. All micros tested in this group appear to stabilize very quickly. - Refactor BigIntegers to avoid re-running some (most) micros over and over with parameter values that don't affect them. Expected runtime down from 14 hours to 15 minutes. ------------- Commit messages: - Cleanup SmallShifts, reduce param space - Annotations not cascading to static nested class - Warmup/Measurement not cascading to static nested class - Reduce runtime of java.math micros - Reduce runtime of java.math micros Changes: https://git.openjdk.java.net/jdk/pull/9062/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9062&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287903 Stats: 104 lines in 4 files changed: 66 ins; 33 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/9062.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9062/head:pull/9062 PR: https://git.openjdk.java.net/jdk/pull/9062 From chagedorn at openjdk.java.net Tue Jun 7 12:52:57 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 12:52:57 GMT Subject: RFR: 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect In-Reply-To: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> References: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> Message-ID: <6-C7IgEnGJdIIri7w55t0kNTAoUpM5Cj4CnmTLjcfWE=.aa885714-b974-426a-a93c-cc16b782d8e5@github.com> On Thu, 2 Jun 2022 15:43:09 GMT, Roland Westrelin wrote: > It's another case where because of overunrolling, the main loop is > never executed but not optimized out and the type of some > CastII/ConvI2L for a range check conflicts with the type of its input > resulting in a broken graph for the main loop. > > This is supposed to have been solved by skeleton predicates. There's > indeed a predicate that should catch that the loop is unreachable but > it doesn't constant fold. The shape of the predicate is: > > (CmpUL (SubL 15 (ConvI2L (AddI (CastII int:>=1) 15) minint..maxint)) 16) > > I propose adding a CastII, that is in this case: > > (CmpUL (SubL 15 (ConvI2L (CastII (AddI (CastII int:>=1) 15) 0..max-1) minint..maxint)) 16) > > The justification for the CastII is that the skeleton predicate is a > predicate for a specific iteration of the loop. That iteration of the > loop must be in the range of the iv Phi. > > With the extra CastII, the AddI can be pushed through the CastII and > ConvI2L and the check constant folds. Actually, with the extra CastII, > the predicate is not implemented with a CmpUL but a CmpU because the > code can tell there's no risk of overflow (I did force the use of > CmpUL as an experiment and the CmpUL does constant fold) Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8996 From sviswanathan at openjdk.java.net Tue Jun 7 13:21:58 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 7 Jun 2022 13:21:58 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 03:03:47 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix extra space > > Results are good. > You need second review. @vnkozlov Thanks a lot for the review and test. @jatin-bhateja Could you please review this PR. It is an extension of your earlier work. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From ecaspole at openjdk.java.net Tue Jun 7 13:52:05 2022 From: ecaspole at openjdk.java.net (Eric Caspole) Date: Tue, 7 Jun 2022 13:52:05 GMT Subject: RFR: 8287903: Reduce runtime of java.math microbenchmarks In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 12:34:25 GMT, Claes Redestad wrote: > - Reduce runtime by running fewer forks, fewer iterations, less warmup. All micros tested in this group appear to stabilize very quickly. > - Refactor BigIntegers to avoid re-running some (most) micros over and over with parameter values that don't affect them. > > Expected runtime down from 14 hours to 15 minutes. Looks good. Will make these so much more practical to use. ------------- Marked as reviewed by ecaspole (Committer). PR: https://git.openjdk.java.net/jdk/pull/9062 From roland at openjdk.java.net Tue Jun 7 14:42:39 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Jun 2022 14:42:39 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v3] In-Reply-To: References: Message-ID: > This is another small enhancement for a code shape that showed up in a > MemorySegment micro benchmark. The shape to optimize is the one from test1: > > > for (int i = 0; i < size; i++) { > long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; > > j = Objects.checkIndex(j, size * 4); > > if (((base + j) & 3) != 0) { > throw new RuntimeException(); > } > > v += UNSAFE.getInt(base + j); > } > > > In that code shape, the loop iv is first scaled, result is then casted > to long, range checked and finally address of memory location is > computed. > > The alignment check is transformed so the loop body has no check In > order to eliminate the range check, that loop is transformed into: > > > for (int i1 = ..) { > for (int i2 = ..) { > long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; > > j = Objects.checkIndex(j, size * 4); > > v += UNSAFE.getInt(base + j); > } > } > > > The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... > > In this case, the type of the ConvI2L is [min_jint, max_jint] and type > of CastLL is [0, max_jint] (the CastLL has a narrower type). > > I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that > case. The convI2L and CastII types can be set to [0, max_jint]. The > new address shape is then: > > (AddP base (ConvI2L (CastII (LShiftI (AddI ... > > which optimize well. > > (LShiftI (AddI ... > is transformed into > (AddI (LShiftI ... > because one of the AddI input is loop invariant (i2) and we have: > > (AddP base (ConvI2L (CastII (AddI (LShiftI ... > > Then because the ConvI2L and CastII types are [0, max_jint], the AddI > is pushed through the ConvI2L and CastII: > > (AddP base (AddL (ConvI2L (CastII (LShiftI ... > > base and one of the inputs of the AddL are loop invariant so this > transformed into: > > (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... > > The (AddP ...) is loop invariant so computed before entry. The > (ConvI2L ...) only depends on the loop iv. > > The resulting address is a shift + an add. The address before > transformation requires 2 adds + a shift. Also after unrolling, the > adress of the second access in the loop is cheaper to compute as it > can be derived from the address of the first access. > > For all of this to work: > 1) I added a CastLL::Ideal transformation: > (CastLL (ConvI2L into (ConvI2l (CastII > > 2) I also had to prevent split if to transform (LShiftI (Phi for the > iv Phi of a counted loop. > > > test2 and test3 test 1) and 2) separately. Roland Westrelin has updated the pull request incrementally with three additional commits since the last revision: - Update test/hotspot/jtreg/compiler/c2/irTests/TestConvI2LCastLongLoop.java Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/castnode.hpp Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/castnode.cpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8555/files - new: https://git.openjdk.java.net/jdk/pull/8555/files/a122f0cf..82610973 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8555&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8555&range=01-02 Stats: 3 lines in 3 files changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8555.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8555/head:pull/8555 PR: https://git.openjdk.java.net/jdk/pull/8555 From roland at openjdk.java.net Tue Jun 7 14:42:39 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Jun 2022 14:42:39 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2] In-Reply-To: References: Message-ID: <7gmOn_KbLyYJO4nIxWbRJ_dIurZG3iBmAWrS2hjhYUg=.8ab54061-ae01-47a9-858a-3603131fa14d@github.com> On Tue, 7 Jun 2022 09:20:47 GMT, Tobias Hartmann wrote: > Looks good to me but I'm wondering if we should delay this to JDK 20 as we are late for the JDK 19 release and had many issues with Cast/ConvNodes before. Thanks for the review. I have no objection to delaying until JDK 20. ------------- PR: https://git.openjdk.java.net/jdk/pull/8555 From roland at openjdk.java.net Tue Jun 7 14:43:26 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Jun 2022 14:43:26 GMT Subject: RFR: 8287700: C2 Crash running eclipse benchmark from Dacapo Message-ID: With 8275201 (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses), I made the following change to escape.cpp: @@ -2628,7 +2632,7 @@ bool ConnectionGraph::split_AddP(Node *addp, Node *base) { // this code branch will go away. // if (!t->is_known_instance() && - !base_t->klass()->is_subtype_of(t->klass())) { + !t->maybe_java_subtype_of(base_t)) { return false; // bail out } const TypeOopPtr *tinst = base_t->add_offset(t->offset())->is_oopptr(); @@ -3312,7 +3316,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, } else { tn_t = tn_type->isa_oopptr(); } - if (tn_t != NULL && tinst->klasgs()->is_subtype_of(tn_t->klass())) { + if (tn_t != NULL && tn_t->maybe_java_subtype_of(tinst)) { if (tn_type->isa_narrowoop()) { tn_type = tinst->make_narrowoop(); } else { @@ -3325,7 +3329,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, record_for_optimizer(n); } else { assert(tn_type == TypePtr::NULL_PTR || - tn_t != NULL && !tinst->klass()->is_subtype_of(tn_t->klass()), + tn_t != NULL && !tinst->is_java_subtype_of(tn_t), "unexpected type"); continue; // Skip dead path with different type } Where I inverted the subtype and supertype in a subtype check (that is `tn_t->maybe_java_subtype_of(tinst)` when it was `tinst->klasgs()->is_subtype_of(tn_t->klass())`) in 2 places for no good reason AFAICT now. The assert used to also test the same condition as the if above but I changed that by mistake. This fixes addresses both issues. ------------- Commit messages: - comment - fix & test Changes: https://git.openjdk.java.net/jdk/pull/9054/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9054&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287700 Stats: 76 lines in 2 files changed: 73 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/9054.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9054/head:pull/9054 PR: https://git.openjdk.java.net/jdk/pull/9054 From chagedorn at openjdk.java.net Tue Jun 7 15:06:05 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 15:06:05 GMT Subject: RFR: 8287700: C2 Crash running eclipse benchmark from Dacapo In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 07:48:42 GMT, Roland Westrelin wrote: > With 8275201 (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses), I made the following change to escape.cpp: > > > @@ -2628,7 +2632,7 @@ bool ConnectionGraph::split_AddP(Node *addp, Node *base) { > // this code branch will go away. > // > if (!t->is_known_instance() && > - !base_t->klass()->is_subtype_of(t->klass())) { > + !t->maybe_java_subtype_of(base_t)) { > return false; // bail out > } > const TypeOopPtr *tinst = base_t->add_offset(t->offset())->is_oopptr(); > @@ -3312,7 +3316,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > } else { > tn_t = tn_type->isa_oopptr(); > } > - if (tn_t != NULL && tinst->klasgs()->is_subtype_of(tn_t->klass())) { > + if (tn_t != NULL && tn_t->maybe_java_subtype_of(tinst)) { > if (tn_type->isa_narrowoop()) { > tn_type = tinst->make_narrowoop(); > } else { > @@ -3325,7 +3329,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > record_for_optimizer(n); > } else { > assert(tn_type == TypePtr::NULL_PTR || > - tn_t != NULL && !tinst->klass()->is_subtype_of(tn_t->klass()), > + tn_t != NULL && !tinst->is_java_subtype_of(tn_t), > "unexpected type"); > continue; // Skip dead path with different type > } > > > Where I inverted the subtype and supertype in a subtype check (that is `tn_t->maybe_java_subtype_of(tinst)` when it was `tinst->klasgs()->is_subtype_of(tn_t->klass())`) in 2 places for no good reason AFAICT now. The assert used to also test the same condition as the if above but I changed that by mistake. This fixes addresses both issues. That looks good to me. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9054 From roland at openjdk.java.net Tue Jun 7 15:08:29 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Jun 2022 15:08:29 GMT Subject: RFR: 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect [v2] In-Reply-To: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> References: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> Message-ID: > It's another case where because of overunrolling, the main loop is > never executed but not optimized out and the type of some > CastII/ConvI2L for a range check conflicts with the type of its input > resulting in a broken graph for the main loop. > > This is supposed to have been solved by skeleton predicates. There's > indeed a predicate that should catch that the loop is unreachable but > it doesn't constant fold. The shape of the predicate is: > > (CmpUL (SubL 15 (ConvI2L (AddI (CastII int:>=1) 15) minint..maxint)) 16) > > I propose adding a CastII, that is in this case: > > (CmpUL (SubL 15 (ConvI2L (CastII (AddI (CastII int:>=1) 15) 0..max-1) minint..maxint)) 16) > > The justification for the CastII is that the skeleton predicate is a > predicate for a specific iteration of the loop. That iteration of the > loop must be in the range of the iv Phi. > > With the extra CastII, the AddI can be pushed through the CastII and > ConvI2L and the check constant folds. Actually, with the extra CastII, > the predicate is not implemented with a CmpUL but a CmpU because the > code can tell there's no risk of overflow (I did force the use of > CmpUL as an experiment and the CmpUL does constant fold) Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/loopopts/TestOverUnrolling2.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8996/files - new: https://git.openjdk.java.net/jdk/pull/8996/files/8f4a999a..9c9120ef Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8996&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8996&range=00-01 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8996.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8996/head:pull/8996 PR: https://git.openjdk.java.net/jdk/pull/8996 From roland at openjdk.java.net Tue Jun 7 15:08:43 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Jun 2022 15:08:43 GMT Subject: RFR: 8286451: C2: assert(nb == 1) failed: only when the head is not shared [v2] In-Reply-To: References: Message-ID: > nb counts the number of loops that share a single head. The assert > that fires is in code that handles the case of a self loop (a loop > composed of a single block). There can be a self loop and multiple > loops that share a head: the assert makes little sense and I propose > to simply remove it. > > I think there's another issue with this code: in the case of a self > loop and multiple loops that share a head, the self loop can be any of > the loop for which the head is cloned not only the one that's passed > as argument to ciTypeFlow::clone_loop_head(). As a consequence, I > moved the logic for self loops in the loop that's applied to all loops > that share the loop head. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - review - Merge branch 'master' into JDK-8286451 - fix & test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8947/files - new: https://git.openjdk.java.net/jdk/pull/8947/files/7ece2dc2..f3943fc7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8947&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8947&range=00-01 Stats: 53849 lines in 735 files changed: 28430 ins; 18849 del; 6570 mod Patch: https://git.openjdk.java.net/jdk/pull/8947.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8947/head:pull/8947 PR: https://git.openjdk.java.net/jdk/pull/8947 From chagedorn at openjdk.java.net Tue Jun 7 15:08:45 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 15:08:45 GMT Subject: RFR: 8286451: C2: assert(nb == 1) failed: only when the head is not shared [v2] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 14:54:13 GMT, Roland Westrelin wrote: >> nb counts the number of loops that share a single head. The assert >> that fires is in code that handles the case of a self loop (a loop >> composed of a single block). There can be a self loop and multiple >> loops that share a head: the assert makes little sense and I propose >> to simply remove it. >> >> I think there's another issue with this code: in the case of a self >> loop and multiple loops that share a head, the self loop can be any of >> the loop for which the head is cloned not only the one that's passed >> as argument to ciTypeFlow::clone_loop_head(). As a consequence, I >> moved the logic for self loops in the loop that's applied to all loops >> that share the loop head. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-8286451 > - fix & test That looks good, thanks for doing the updates! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8947 From roland at openjdk.java.net Tue Jun 7 15:08:47 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Jun 2022 15:08:47 GMT Subject: RFR: 8286451: C2: assert(nb == 1) failed: only when the head is not shared [v2] In-Reply-To: References: Message-ID: <2GycWLABQJ3chAz_BsSi-m8uU2ed1dzbYXSykJlCvHc=.ea2c3132-08bd-4121-be9c-dcfd327f52c7@github.com> On Tue, 7 Jun 2022 12:40:17 GMT, Christian Hagedorn wrote: > Looks good! Thanks for the review. > > nb counts the number of loops that share a single head > > Maybe you also want to rename the variable to make it more clear what it counts. Done. > test/hotspot/jtreg/compiler/ciTypeFlow/TestSharedLoopHead.java line 28: > >> 26: * @bug 8286451 >> 27: * @summary C2: assert(nb == 1) failed: only when the head is not shared >> 28: * @run main/othervm TestSharedLoopHead > > Could be converted to `@run driver TestSharedLoopHead` instead. Right but the test needs -XX:-BackgroundCompilation ------------- PR: https://git.openjdk.java.net/jdk/pull/8947 From kvn at openjdk.java.net Tue Jun 7 16:01:27 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 7 Jun 2022 16:01:27 GMT Subject: RFR: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node In-Reply-To: References: Message-ID: <2FpWwr6cZausUARllThLxGmfUFBP9bRo_FMse3JZx-8=.e38d055f-05bd-4d40-9c9e-4dcef0a01b64@github.com> On Tue, 7 Jun 2022 12:16:00 GMT, Christian Hagedorn wrote: > When intrisifying `java.lang.Thread::currentThread()`, we are creating an `AddP` node that has the `top` node as base to indicate that we do not have an oop (using `NULL` instead leads to crashes as it does not seem to be expected to have a `NULL` base): > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/library_call.cpp#L904 > > This node is used on a chain of data nodes into two `MemBarAcquire` nodes as precedence edge in the test case: > ![Screenshot from 2022-06-07 11-12-38](https://user-images.githubusercontent.com/17833009/172344751-5338b72f-baa5-4e9e-a44c-6d970798d9f2.png) > > Later, in `final_graph_reshaping_impl()`, we are removing the precedence edge of both `MemBarAcquire` nodes and clean up all now dead nodes as a result of the removal: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/compile.cpp#L3655-L3679 > > We iteratively call `disconnect_inputs()` for all nodes that have no output anymore (i.e. dead nodes). This code, however, also treats the `top` node as dead since `outcnt()` of `top` is always zero: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/node.hpp#L495-L500 > > And we end up disconnecting `top` which results in the assertion failure. > > The code misses a check for `top()`. I suggest to add this check before processing a node for which `outcnt()` is zero. This is a pattern which can also be found in other places in the code. I've checked all other usages of `oucnt() == 0` and could not find a case where this additional `top()` check is missing. Maybe we should refactor these two checks into a single method at some point to not need to worry about `top` anymore in the future when checking if a node is dead based on the outputs. > > Thanks, > Christian Your analysis and fix is correct. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9060 From thartmann at openjdk.java.net Tue Jun 7 16:14:26 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Jun 2022 16:14:26 GMT Subject: RFR: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 12:16:00 GMT, Christian Hagedorn wrote: > When intrisifying `java.lang.Thread::currentThread()`, we are creating an `AddP` node that has the `top` node as base to indicate that we do not have an oop (using `NULL` instead leads to crashes as it does not seem to be expected to have a `NULL` base): > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/library_call.cpp#L904 > > This node is used on a chain of data nodes into two `MemBarAcquire` nodes as precedence edge in the test case: > ![Screenshot from 2022-06-07 11-12-38](https://user-images.githubusercontent.com/17833009/172344751-5338b72f-baa5-4e9e-a44c-6d970798d9f2.png) > > Later, in `final_graph_reshaping_impl()`, we are removing the precedence edge of both `MemBarAcquire` nodes and clean up all now dead nodes as a result of the removal: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/compile.cpp#L3655-L3679 > > We iteratively call `disconnect_inputs()` for all nodes that have no output anymore (i.e. dead nodes). This code, however, also treats the `top` node as dead since `outcnt()` of `top` is always zero: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/node.hpp#L495-L500 > > And we end up disconnecting `top` which results in the assertion failure. > > The code misses a check for `top()`. I suggest to add this check before processing a node for which `outcnt()` is zero. This is a pattern which can also be found in other places in the code. I've checked all other usages of `oucnt() == 0` and could not find a case where this additional `top()` check is missing. Maybe we should refactor these two checks into a single method at some point to not need to worry about `top` anymore in the future when checking if a node is dead based on the outputs. > > Thanks, > Christian Nice analysis and test. Looks good. test/hotspot/jtreg/compiler/c2/TestRemoveMemBarPrecEdge.java line 45: > 43: > 44: public static void test() { > 45: // currentThread() is intrisified and C2 emits a special AddP node with a base that is top. Suggestion: // currentThread() is intrinsified and C2 emits a special AddP node with a base that is top. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9060 From chagedorn at openjdk.java.net Tue Jun 7 16:56:15 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 16:56:15 GMT Subject: RFR: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node [v2] In-Reply-To: References: Message-ID: > When intrisifying `java.lang.Thread::currentThread()`, we are creating an `AddP` node that has the `top` node as base to indicate that we do not have an oop (using `NULL` instead leads to crashes as it does not seem to be expected to have a `NULL` base): > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/library_call.cpp#L904 > > This node is used on a chain of data nodes into two `MemBarAcquire` nodes as precedence edge in the test case: > ![Screenshot from 2022-06-07 11-12-38](https://user-images.githubusercontent.com/17833009/172344751-5338b72f-baa5-4e9e-a44c-6d970798d9f2.png) > > Later, in `final_graph_reshaping_impl()`, we are removing the precedence edge of both `MemBarAcquire` nodes and clean up all now dead nodes as a result of the removal: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/compile.cpp#L3655-L3679 > > We iteratively call `disconnect_inputs()` for all nodes that have no output anymore (i.e. dead nodes). This code, however, also treats the `top` node as dead since `outcnt()` of `top` is always zero: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/node.hpp#L495-L500 > > And we end up disconnecting `top` which results in the assertion failure. > > The code misses a check for `top()`. I suggest to add this check before processing a node for which `outcnt()` is zero. This is a pattern which can also be found in other places in the code. I've checked all other usages of `oucnt() == 0` and could not find a case where this additional `top()` check is missing. Maybe we should refactor these two checks into a single method at some point to not need to worry about `top` anymore in the future when checking if a node is dead based on the outputs. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/c2/TestRemoveMemBarPrecEdge.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/9060/files - new: https://git.openjdk.java.net/jdk/pull/9060/files/3f351d40..46a1ef11 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9060&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9060&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/9060.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9060/head:pull/9060 PR: https://git.openjdk.java.net/jdk/pull/9060 From chagedorn at openjdk.java.net Tue Jun 7 16:56:16 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 7 Jun 2022 16:56:16 GMT Subject: RFR: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 12:16:00 GMT, Christian Hagedorn wrote: > When intrisifying `java.lang.Thread::currentThread()`, we are creating an `AddP` node that has the `top` node as base to indicate that we do not have an oop (using `NULL` instead leads to crashes as it does not seem to be expected to have a `NULL` base): > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/library_call.cpp#L904 > > This node is used on a chain of data nodes into two `MemBarAcquire` nodes as precedence edge in the test case: > ![Screenshot from 2022-06-07 11-12-38](https://user-images.githubusercontent.com/17833009/172344751-5338b72f-baa5-4e9e-a44c-6d970798d9f2.png) > > Later, in `final_graph_reshaping_impl()`, we are removing the precedence edge of both `MemBarAcquire` nodes and clean up all now dead nodes as a result of the removal: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/compile.cpp#L3655-L3679 > > We iteratively call `disconnect_inputs()` for all nodes that have no output anymore (i.e. dead nodes). This code, however, also treats the `top` node as dead since `outcnt()` of `top` is always zero: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/node.hpp#L495-L500 > > And we end up disconnecting `top` which results in the assertion failure. > > The code misses a check for `top()`. I suggest to add this check before processing a node for which `outcnt()` is zero. This is a pattern which can also be found in other places in the code. I've checked all other usages of `oucnt() == 0` and could not find a case where this additional `top()` check is missing. Maybe we should refactor these two checks into a single method at some point to not need to worry about `top` anymore in the future when checking if a node is dead based on the outputs. > > Thanks, > Christian Thanks Vladimir and Tobias for your reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/9060 From rcastanedalo at openjdk.java.net Tue Jun 7 17:14:23 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 7 Jun 2022 17:14:23 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v29] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 11:17:22 GMT, Emanuel Peter wrote: >> **What this gives you for the debugger** >> - BFS traversal (inputs / outputs) >> - node filtering by category >> - shortest path between nodes >> - all paths between nodes >> - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) >> - and more >> >> **Some usecases** >> - more readable `dump` >> - follow only nodes of some categories (only control, only data, etc) >> - find which control nodes depend on data node (visit data nodes, include control in boundary) >> - how two nodes relate (shortest / all paths, following input/output nodes, or both) >> - find loops (control / memory / data: call all paths with node as start and target) >> >> **Description** >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. >> >> `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` >> >> To get familiar with the many options, run this to get help: >> `find_node(0)->dump_bfs(0,0,"h")` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. >> >> **Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. >> 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. >> 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. >> 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. >> 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. >> 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. >> 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! >> >> Example (BFS inputs): >> >> (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") >> d dump >> --------------------------------------------- >> 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> >> >> Example (BFS control inputs): >> >> (rr) p find_node(163)->dump_bfs(5,0,"c+") >> d dump >> --------------------------------------------- >> 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 >> >> We see the control flow of a strip mined loop. >> >> >> Experiment (BFS only data, but display all nodes on boundary) >> >> (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") >> d dump >> --------------------------------------------- >> 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) >> 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] >> >> We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). >> >> Example with Mach nodes: >> >> (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") >> d [head idom d] old dump >> --------------------------------------------- >> 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd >> 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) >> 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] >> 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] >> 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> And the query on the old nodes: >> >> (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") >> d dump >> --------------------------------------------- >> 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] >> 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] >> 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 >> 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] >> 1 o740 Bool === _ o739 [[ o741 ]] [lt] >> 1 o179 IfTrue === o178 [[ o741 ]] #1 >> 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 >> >> >> **Exploring loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "c+")` >> This provides us with a shortest control path, given this path has a distance of at most 20. >> >> Example (shortest path over control nodes): >> >> (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") >> d dump >> --------------------------------------------- >> 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> >> Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. >> >> Example (all paths between two nodes): >> >> (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. >> >> An alternative to detect loops quickly, is running an all paths query from a node to itself: >> >> Example (loop detection with all paths): >> >> (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") >> d apd dump >> --------------------------------------------- >> 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) >> 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We get the loop control, plus the loop-back `190 IfTrue`. >> >> Example (loop detection with all paths for phi): >> >> (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") >> d apd dump >> --------------------------------------------- >> 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> >> >> **Color examples** >> Colors are especially useful to see chains between nodes (options character `#`). >> The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. >> Tip: it can be worth it to configure the colors of your terminal to be more appealing. >> >> Example (find control dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) >> We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. >> >> Example (find memory dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) >> >> Example (loop detection): >> ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) >> We find the control and some data loop paths. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review from @TobiHartmann > > Thank you @TobiHartmann > > Co-authored-by: Tobias Hartmann Thanks for adding this functionality, Emanuel! I tried it out and works fine for my regular use cases. The code itself looks good, too. I have a few minor comments, please take them as suggestions only. src/hotspot/share/opto/node.cpp line 1997: > 1995: tty->print(" H: display this help info, with examples\n"); > 1996: tty->print(" +: traverse in-edges (on if neither + nor -)\n"); > 1997: tty->print(" -: traverse out-edges\n"); I would find it more natural if out-edges were also traversed by default, but it is a matter of taste. src/hotspot/share/opto/node.cpp line 1997: > 1995: tty->print(" H: display this help info, with examples\n"); > 1996: tty->print(" +: traverse in-edges (on if neither + nor -)\n"); > 1997: tty->print(" -: traverse out-edges\n"); I would find it more natural if out-edges were also traversed by default, but it is a matter of taste. src/hotspot/share/opto/node.cpp line 2240: > 2238: tty->print(" _"); > 2239: } else { > 2240: print_node_idx(b->head()); I think it would also be useful to print the block identifier, i.e. `print("B%d", b->_pre_order)`. src/hotspot/share/opto/node.cpp line 2240: > 2238: tty->print(" _"); > 2239: } else { > 2240: print_node_idx(b->head()); I think it would also be useful to print the block identifier, i.e. `print("B%d", b->_pre_order)`. src/hotspot/share/opto/node.cpp line 2278: > 2276: } > 2277: if (_print_blocks) { > 2278: tty->print(" [head idom d]"); // block Perhaps rename the third block column `d` to something more descriptive (`depth` or similar) for clarity, that would also break the ambiguity with `d` which is already used for "distance". src/hotspot/share/opto/node.cpp line 2278: > 2276: } > 2277: if (_print_blocks) { > 2278: tty->print(" [head idom d]"); // block Perhaps rename the third block column `d` to something more descriptive (`depth` or similar) for clarity, that would also break the ambiguity with `d` which is already used for "distance". src/hotspot/share/opto/node.cpp line 2310: > 2308: // To find all options, run: > 2309: // find_node(0)->dump_bfs(0,0,"H") > 2310: void Node::dump_bfs(const int max_distance, Node* target, char const* options) { Would be great to have a short version `Node::dump_bfs(n)` equivalent to `Node::dump_bfs(n, 0, 0)`, for convenience. src/hotspot/share/opto/node.cpp line 2310: > 2308: // To find all options, run: > 2309: // find_node(0)->dump_bfs(0,0,"H") > 2310: void Node::dump_bfs(const int max_distance, Node* target, char const* options) { Would be great to have a short version `Node::dump_bfs(n)` equivalent to `Node::dump_bfs(n, 0, 0)`, for convenience. src/hotspot/share/opto/node.hpp line 1193: > 1191: Node* find(int idx, bool only_ctrl = false); // Search the graph for the given idx. > 1192: Node* find_ctrl(int idx); // Search control ancestors for the given idx. > 1193: void dump_bfs(const int max_distance, Node* target, char const* options); // Print BFS traversal Suggestion: `char const* options` -> `const char* options` (same for the other occurrences in the changeset). ------------- Marked as reviewed by rcastanedalo (Committer). PR: https://git.openjdk.java.net/jdk/pull/8468 From xliu at openjdk.java.net Tue Jun 7 17:15:59 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 7 Jun 2022 17:15:59 GMT Subject: Integrated: 8287840: Dead copy region node blocks IfNode's fold-compares In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 01:15:35 GMT, Xin Liu wrote: > IfNode::fold_compares() requires ctrl has a single output. I found some fold-compares case postpone to IterGVN2. The reason is that a dead region prevents IfNode::fold_compares() from transforming code. The dead node is removed in IterGVN, but it's too late. > > This PR extends Node::has_special_unique_user() so `PhaseIterGVN::remove_globally_dead_node()` puts IfNode back to worklist. The following attempt will carry out fold-compares(). This pull request has now been integrated. Changeset: 3da7e393 Author: Xin Liu URL: https://git.openjdk.java.net/jdk/commit/3da7e393ee4b45c40b8bb132dd09f5a6ba306116 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod 8287840: Dead copy region node blocks IfNode's fold-compares Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/9035 From dcubed at openjdk.java.net Tue Jun 7 17:29:11 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 7 Jun 2022 17:29:11 GMT Subject: Integrated: 8287919: ProblemList java/lang/CompressExpandTest.java Message-ID: A trivial fix to ProblemList java/lang/CompressExpandTest.java. ------------- Commit messages: - 8287919: ProblemList java/lang/CompressExpandTest.java Changes: https://git.openjdk.java.net/jdk/pull/9069/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9069&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287919 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/9069.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9069/head:pull/9069 PR: https://git.openjdk.java.net/jdk/pull/9069 From azvegint at openjdk.java.net Tue Jun 7 17:29:12 2022 From: azvegint at openjdk.java.net (Alexander Zvegintsev) Date: Tue, 7 Jun 2022 17:29:12 GMT Subject: Integrated: 8287919: ProblemList java/lang/CompressExpandTest.java In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 17:17:23 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList java/lang/CompressExpandTest.java. Marked as reviewed by azvegint (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/9069 From dcubed at openjdk.java.net Tue Jun 7 17:29:14 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 7 Jun 2022 17:29:14 GMT Subject: Integrated: 8287919: ProblemList java/lang/CompressExpandTest.java In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 17:18:33 GMT, Alexander Zvegintsev wrote: >> A trivial fix to ProblemList java/lang/CompressExpandTest.java. > > Marked as reviewed by azvegint (Reviewer). @azvegint - Thanks for the fast review! ------------- PR: https://git.openjdk.java.net/jdk/pull/9069 From dcubed at openjdk.java.net Tue Jun 7 17:29:15 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 7 Jun 2022 17:29:15 GMT Subject: Integrated: 8287919: ProblemList java/lang/CompressExpandTest.java In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 17:17:23 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList java/lang/CompressExpandTest.java. This pull request has now been integrated. Changeset: 91e6bf67 Author: Daniel D. Daugherty URL: https://git.openjdk.java.net/jdk/commit/91e6bf6791b7fc26db6f4288830091d812232dd8 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8287919: ProblemList java/lang/CompressExpandTest.java Reviewed-by: azvegint ------------- PR: https://git.openjdk.java.net/jdk/pull/9069 From kvn at openjdk.java.net Tue Jun 7 17:35:08 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 7 Jun 2022 17:35:08 GMT Subject: RFR: 8285965: TestScenarios.java does not check for "" correctly In-Reply-To: References: Message-ID: On Wed, 11 May 2022 06:13:12 GMT, Christian Hagedorn wrote: > This is another rare occurrence of `` that is not handled correctly by `TestScenarios.java`. > > We wrongly search this safepoint message in the test VM output with `getTestVMOutput()`: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/Utils.java#L44-L53 > > But this does not help since the IR matcher is parsing the `hotspot_pid` file for IR matching and not the test VM output. We could therefore find this safepoint message in the `hotspod_pid` file and bail out of IR matching while the test VM output does not contain it. This lets `TestScenarios.java` fail. > > The fix we did for other IR framework tests is to redirect the output of the JTreg test VM itself to a stream in order to search it for ``. We are dumping this message as part of a warning when the IR matcher bails out: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/IRMatcher.java#L86-L96 > > Output for the reported failure: > > Scenario #3 - [-XX:TLABRefillWasteFraction=53]: > [...] > Found , bail out of IR matching > > > I suggest to use the same fix for `TestScenarios`. > > Thanks, > Christian Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8647 From kvn at openjdk.java.net Tue Jun 7 18:14:19 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 7 Jun 2022 18:14:19 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long In-Reply-To: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Tue, 7 Jun 2022 17:14:18 GMT, Quan Anh Mai wrote: > Hi, > > This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . > > Thank you very much. Please add microbenchmark and show its results. src/hotspot/share/opto/subnode.hpp line 217: > 215: //------------------------------CmpU3Node-------------------------------------- > 216: // Compare 2 unsigned values, returning integer value (-1, 0 or 1). > 217: class CmpU3Node : public CmpUNode { Place it after `CmpUNode` class. ------------- PR: https://git.openjdk.java.net/jdk/pull/9068 From duke at openjdk.java.net Tue Jun 7 21:02:02 2022 From: duke at openjdk.java.net (Devin Smith) Date: Tue, 7 Jun 2022 21:02:02 GMT Subject: RFR: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node [v2] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 16:56:15 GMT, Christian Hagedorn wrote: >> When intrisifying `java.lang.Thread::currentThread()`, we are creating an `AddP` node that has the `top` node as base to indicate that we do not have an oop (using `NULL` instead leads to crashes as it does not seem to be expected to have a `NULL` base): >> https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/library_call.cpp#L904 >> >> This node is used on a chain of data nodes into two `MemBarAcquire` nodes as precedence edge in the test case: >> ![Screenshot from 2022-06-07 11-12-38](https://user-images.githubusercontent.com/17833009/172344751-5338b72f-baa5-4e9e-a44c-6d970798d9f2.png) >> >> Later, in `final_graph_reshaping_impl()`, we are removing the precedence edge of both `MemBarAcquire` nodes and clean up all now dead nodes as a result of the removal: >> https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/compile.cpp#L3655-L3679 >> >> We iteratively call `disconnect_inputs()` for all nodes that have no output anymore (i.e. dead nodes). This code, however, also treats the `top` node as dead since `outcnt()` of `top` is always zero: >> https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/node.hpp#L495-L500 >> >> And we end up disconnecting `top` which results in the assertion failure. >> >> The code misses a check for `top()`. I suggest to add this check before processing a node for which `outcnt()` is zero. This is a pattern which can also be found in other places in the code. I've checked all other usages of `oucnt() == 0` and could not find a case where this additional `top()` check is missing. Maybe we should refactor these two checks into a single method at some point to not need to worry about `top` anymore in the future when checking if a node is dead based on the outputs. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/c2/TestRemoveMemBarPrecEdge.java > > Co-authored-by: Tobias Hartmann I'm not qualified to review this, but I can confirm the test is triggering the `PhaseAggressiveCoalesce::coalesce` SIGSEGV against the 11/17 LTS versions I'm running. Thanks for your investigation and fix! ------------- PR: https://git.openjdk.java.net/jdk/pull/9060 From duke at openjdk.java.net Tue Jun 7 21:22:39 2022 From: duke at openjdk.java.net (Yi-Fan Tsai) Date: Tue, 7 Jun 2022 21:22:39 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap In-Reply-To: References: Message-ID: On Sat, 4 Jun 2022 01:43:08 GMT, Dean Long wrote: > if it wouldn't be better to use a subclass of CompiledMethod An earlier [commit](https://github.com/openjdk/jdk/compare/994f2e92...yftsai:127609e3) tried the idea. It implemented many functions that are irrelevant to MH intrinsics. ------------- PR: https://git.openjdk.java.net/jdk/pull/8760 From duke at openjdk.java.net Tue Jun 7 23:25:34 2022 From: duke at openjdk.java.net (Yi-Fan Tsai) Date: Tue, 7 Jun 2022 23:25:34 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap In-Reply-To: References: Message-ID: On Sat, 4 Jun 2022 01:23:58 GMT, Dean Long wrote: > Can't we use the existing AdapterBlob or MethodHandlesAdapterBlob? An MH intrinsic is handled differently from them in SharedRuntime::continuation_for_implicit_exception and compiledIC. The extra field _method is used in places like trace_exception unless this information is not important. ------------- PR: https://git.openjdk.java.net/jdk/pull/8760 From kvn at openjdk.java.net Tue Jun 7 23:26:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 7 Jun 2022 23:26:35 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> Message-ID: On Thu, 2 Jun 2022 17:44:54 GMT, Sandhya Viswanathan wrote: >>> Hi @sviswa7 , #7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case? >> >> Maybe. But it would require more careful changes. And that changeset is not integrated yet. >> Current changes are clean and serve their purpose good. >> >> And, as Jatin and Sandhya said, we may do proper fix after JDK 19 fork. Then we can look on your proposal. > > @vnkozlov @jatin-bhateja Your review comments are implemented. Please take a look. @sviswa7, please, file separate RFE for SPECjvm2008-SOR.small issue (different unrolling factor). I got all performance data from our and yours data and I think these change are ready for integration. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Tue Jun 7 23:38:32 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 7 Jun 2022 23:38:32 GMT Subject: RFR: 8287700: C2 Crash running eclipse benchmark from Dacapo In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 07:48:42 GMT, Roland Westrelin wrote: > With 8275201 (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses), I made the following change to escape.cpp: > > > @@ -2628,7 +2632,7 @@ bool ConnectionGraph::split_AddP(Node *addp, Node *base) { > // this code branch will go away. > // > if (!t->is_known_instance() && > - !base_t->klass()->is_subtype_of(t->klass())) { > + !t->maybe_java_subtype_of(base_t)) { > return false; // bail out > } > const TypeOopPtr *tinst = base_t->add_offset(t->offset())->is_oopptr(); > @@ -3312,7 +3316,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > } else { > tn_t = tn_type->isa_oopptr(); > } > - if (tn_t != NULL && tinst->klasgs()->is_subtype_of(tn_t->klass())) { > + if (tn_t != NULL && tn_t->maybe_java_subtype_of(tinst)) { > if (tn_type->isa_narrowoop()) { > tn_type = tinst->make_narrowoop(); > } else { > @@ -3325,7 +3329,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > record_for_optimizer(n); > } else { > assert(tn_type == TypePtr::NULL_PTR || > - tn_t != NULL && !tinst->klass()->is_subtype_of(tn_t->klass()), > + tn_t != NULL && !tinst->is_java_subtype_of(tn_t), > "unexpected type"); > continue; // Skip dead path with different type > } > > > Where I inverted the subtype and supertype in a subtype check (that is `tn_t->maybe_java_subtype_of(tinst)` when it was `tinst->klasgs()->is_subtype_of(tn_t->klass())`) in 2 places for no good reason AFAICT now. The assert used to also test the same condition as the if above but I changed that by mistake. This fixes addresses both issues. Good. Testing submitted by Tobias passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9054 From duke at openjdk.java.net Tue Jun 7 23:48:36 2022 From: duke at openjdk.java.net (Yi-Fan Tsai) Date: Tue, 7 Jun 2022 23:48:36 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v2] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 17:42:00 GMT, Jorn Vernee wrote: >> Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove dead codes >> >> remove unused argument of NativeJump::check_verified_entry_alignment >> remove unused argument of NativeJumip::patch_verified_entry >> remove dead codes in SharedRuntime::generate_method_handle_intrinsic_wrapper > > src/hotspot/share/ci/ciMethod.cpp line 1131: > >> 1129: CompiledMethod* cm = (code == nullptr) ? nullptr : code->as_compiled_method_or_null(); >> 1130: if (cm != NULL && (cm->comp_level() == CompLevel_full_optimization)) { >> 1131: _instructions_size = cm->insts_end() - cm->verified_entry_point(); > > So, the old code only used sets _instruction_size if `comp_level` is `CompLevel_full_optimization`. Since the old shared runtime code used `new_native_nmethod`, which ends up setting comp_level to `CompLevel_none`, this branch is also not being taken in the current code? > > In that case, this looks good. Correct, comp_level of MH intrinsics were set to `CompLevel_none`. > src/hotspot/share/code/codeBlob.cpp line 354: > >> 352: if (mhi != NULL) { >> 353: debug_only(mhi->verify();) // might block >> 354: } > > This is debug only. Looking at CodeCache::allocate, it can only return `NULL` if the allocation size is `<= 0`, in which case an earlier assert will already fire. So, this null check doesn't seem needed? > Suggestion: > > debug_only(mhi->verify();) // might block This seems needed. CodeCache::allocate may return `NULL` if the code cache is full. > src/hotspot/share/code/codeBlob.cpp line 360: > >> 358: void MethodHandleIntrinsicBlob::verify() { >> 359: // Make sure all the entry points are correctly aligned for patching. >> 360: NativeJump::check_verified_entry_alignment(code_begin(), code_begin()); > > This method only seems implemented on x86, which ignores the first argument. Maybe it's a good opportunity to clean up the first argument? Sure. Same for NativeJump::patch_verified_entry. > src/hotspot/share/compiler/compileBroker.hpp line 307: > >> 305: TRAPS); >> 306: >> 307: static CodeBlob* compile_method(const methodHandle& method, > > Not so sure about these changes. It seems to me that if a method is requested to be compiled, it should always result in an nmethod. > > Alternatively, would it be possible to keep these functions returning an `nmethod` but add an assert at the start to check that the passed `method` is not a method handle intrinsic? The implementation assumed that MH intrinsics are possible input. Once the assertion is added, two conditions could be simplified. > src/hotspot/share/oops/method.cpp line 1304: > >> 1302: assert(blob->is_mh_intrinsic(), "must be MH intrinsic"); >> 1303: MethodHandleIntrinsicBlob* mhi = blob->as_mh_intrinsic(); >> 1304: return (mhi->method() == nullptr) || (mhi->method() == this); > > The assert looks redundant, since the cast on the next line already checks it. > > Also, can `mhi->method()` really be `nullptr` here? No, it shouldn't be. Removed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8760 From duke at openjdk.java.net Tue Jun 7 23:48:39 2022 From: duke at openjdk.java.net (Yi-Fan Tsai) Date: Tue, 7 Jun 2022 23:48:39 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v2] In-Reply-To: <1id0AtAL5Ux5ZhRL2HkgiOJ0amJycNFhUAnR9fsVbTA=.af3bca67-32e6-4e5f-b9ea-1a1825fec65d@github.com> References: <1id0AtAL5Ux5ZhRL2HkgiOJ0amJycNFhUAnR9fsVbTA=.af3bca67-32e6-4e5f-b9ea-1a1825fec65d@github.com> Message-ID: On Sat, 4 Jun 2022 01:11:10 GMT, Dean Long wrote: >> src/hotspot/share/code/codeBlob.cpp line 342: >> >>> 340: MethodHandleIntrinsicBlob* MethodHandleIntrinsicBlob::create(const methodHandle& method, >>> 341: CodeBuffer *code_buffer) { >>> 342: code_buffer->finalize_oop_references(method); >> >> Can you please explain why this is needed? (I'm a bit surprised since the constructor asserts that `total_oop_size` is 0) >> >> Thanks. > > I don't think finalize_oop_references() makes sense except for nmethods. If the MH intrinsic could contain contain oops, then GC would need to be able to find and relocate them. It is not needed. Removed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8760 From xliu at openjdk.java.net Wed Jun 8 00:34:31 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 8 Jun 2022 00:34:31 GMT Subject: RFR: 8287700: C2 Crash running eclipse benchmark from Dacapo In-Reply-To: References: Message-ID: <7fxq_JmjtpF9RtxodZIKhG71fEBMjE4htKBvbst28pA=.dde14aef-bde5-4c13-8e97-5612d3b8eb8f@github.com> On Tue, 7 Jun 2022 07:48:42 GMT, Roland Westrelin wrote: > With 8275201 (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses), I made the following change to escape.cpp: > > > @@ -2628,7 +2632,7 @@ bool ConnectionGraph::split_AddP(Node *addp, Node *base) { > // this code branch will go away. > // > if (!t->is_known_instance() && > - !base_t->klass()->is_subtype_of(t->klass())) { > + !t->maybe_java_subtype_of(base_t)) { > return false; // bail out > } > const TypeOopPtr *tinst = base_t->add_offset(t->offset())->is_oopptr(); > @@ -3312,7 +3316,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > } else { > tn_t = tn_type->isa_oopptr(); > } > - if (tn_t != NULL && tinst->klasgs()->is_subtype_of(tn_t->klass())) { > + if (tn_t != NULL && tn_t->maybe_java_subtype_of(tinst)) { > if (tn_type->isa_narrowoop()) { > tn_type = tinst->make_narrowoop(); > } else { > @@ -3325,7 +3329,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > record_for_optimizer(n); > } else { > assert(tn_type == TypePtr::NULL_PTR || > - tn_t != NULL && !tinst->klass()->is_subtype_of(tn_t->klass()), > + tn_t != NULL && !tinst->is_java_subtype_of(tn_t), > "unexpected type"); > continue; // Skip dead path with different type > } > > > Where I inverted the subtype and supertype in a subtype check (that is `tn_t->maybe_java_subtype_of(tinst)` when it was `tinst->klasgs()->is_subtype_of(tn_t->klass())`) in 2 places for no good reason AFAICT now. The assert used to also test the same condition as the if above but I changed that by mistake. This fixes addresses both issues. LGTM. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.java.net/jdk/pull/9054 From sviswanathan at openjdk.java.net Wed Jun 8 01:06:40 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 8 Jun 2022 01:06:40 GMT Subject: RFR: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake [v2] In-Reply-To: References: <54Cx68cjFE-RfvwVJB92DhENPyRIwzhi3jfyG5ZGPSg=.563519e8-7880-4754-933c-78d66affabef@github.com> <2ZEEJJQuJDrG1UuL6IOMr5nvCm1DCs2PLPp4y0Dpqag=.0d3588d9-6f49-43c6-bfbf-62cdb239450f@github.com> Message-ID: On Tue, 7 Jun 2022 23:22:42 GMT, Vladimir Kozlov wrote: >> @vnkozlov @jatin-bhateja Your review comments are implemented. Please take a look. > > @sviswa7, please, file separate RFE for SPECjvm2008-SOR.small issue (different unrolling factor). > > I got all performance data from our and yours data and I think these change are ready for integration. Thanks! @vnkozlov Thanks a lot. I have filed the RFE: https://bugs.openjdk.org/browse/JDK-8287966. ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From sviswanathan at openjdk.java.net Wed Jun 8 01:07:53 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 8 Jun 2022 01:07:53 GMT Subject: Integrated: 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake In-Reply-To: References: Message-ID: On Wed, 25 May 2022 01:48:16 GMT, Sandhya Viswanathan wrote: > We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2. > The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%. > The performance regression is due to auto-vectorization of small loops. > We don?t have AVX3Threshold consideration in auto-vectorization. > The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors. > > This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line. > > Please review. > > Best Regard, > Sandhya This pull request has now been integrated. Changeset: 45f1b72a Author: Sandhya Viswanathan URL: https://git.openjdk.java.net/jdk/commit/45f1b72a6ee5b86923c3217f101a90851c30401f Stats: 48 lines in 6 files changed: 42 ins; 0 del; 6 mod 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake Reviewed-by: kvn, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/8877 From kvn at openjdk.java.net Wed Jun 8 02:29:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 8 Jun 2022 02:29:43 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 09:42:02 GMT, Xiaohong Gong wrote: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 src/hotspot/share/opto/matcher.cpp line 2255: > 2253: case Op_FmaVF: > 2254: case Op_MacroLogicV: > 2255: case Op_LoadVectorMasked: Why it is removed? src/hotspot/share/opto/vectornode.cpp line 868: > 866: default: > 867: node->add_req(mask); > 868: node->add_flag(Node::Flag_is_predicated_vector); Add assert that only VectorMaskOpNode and ReductionNode expected here. src/hotspot/share/opto/vectornode.cpp line 951: > 949: > 950: Node* LoadVectorNode::Ideal(PhaseGVN* phase, bool can_reshape) { > 951: const TypeVect* vt = as_LoadVector()->vect_type(); Why you need `as_LoadVector()` fro `this`? Same in `StoreVectorNode::Ideal(). src/hotspot/share/opto/vectornode.cpp line 988: > 986: } > 987: } > 988: return LoadNode::Ideal(phase, can_reshape); Should this call `LoadVectorNode::Ideal`? I understand you did optimization because `vector_needs_partial_operations` is false for `LoadVectorMaskedNode` in aarch64 case. But what if it is different on some other (not current) platform? src/hotspot/share/opto/vectornode.cpp line 1008: > 1006: } > 1007: } > 1008: return StoreNode::Ideal(phase, can_reshape); Should this call `StoreVectorNode::Ideal`? src/hotspot/share/opto/vectornode.cpp line 1821: > 1819: // Transform (MaskAll m1 (VectorMaskGen len)) ==> (VectorMaskGen len) > 1820: // if the vector length in bytes is lower than the MaxVectorSize. > 1821: if (is_con_M1(in(1)) && length_in_bytes() < MaxVectorSize) { Due to #8877 such length check may not correct here. And I don't see `in(2)->Opcode() == Op_VectorMaskGen` check. ------------- PR: https://git.openjdk.java.net/jdk/pull/9037 From xgong at openjdk.java.net Wed Jun 8 02:37:35 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 8 Jun 2022 02:37:35 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 01:55:12 GMT, Vladimir Kozlov wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > src/hotspot/share/opto/matcher.cpp line 2255: > >> 2253: case Op_FmaVF: >> 2254: case Op_MacroLogicV: >> 2255: case Op_LoadVectorMasked: > > Why it is removed? Thanks for looking at this PR! This is removed since we added two combined rules in the aarch64_sve.ad (line-2383) like: match(Set dst (VectorLoadMask (LoadVectorMasked mem pg))); Setting `Op_LoadVectorMasked` as a root in matcher will make such rules not be matched. I'm also curious why this op is added here. Does it have any influence if I remove it? Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/9037 From xgong at openjdk.java.net Wed Jun 8 03:00:37 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 8 Jun 2022 03:00:37 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 02:25:48 GMT, Vladimir Kozlov wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > src/hotspot/share/opto/vectornode.cpp line 868: > >> 866: default: >> 867: node->add_req(mask); >> 868: node->add_flag(Node::Flag_is_predicated_vector); > > Add assert that only VectorMaskOpNode and ReductionNode expected here. We have other vectornodes like `VectorMaskCmp` , `MaskAll` and `VectorLoadMask` also needs to append the mask here. Actually most masked vector nodes accept the mask input except for the load/store/gather/scatter. And in future, we may extend this to other normal vector nodes whose vector length is full-size while not partial, since SVE always needs a predicate for most instructions. So the default patch will be used for most vector nodes. > src/hotspot/share/opto/vectornode.cpp line 951: > >> 949: >> 950: Node* LoadVectorNode::Ideal(PhaseGVN* phase, bool can_reshape) { >> 951: const TypeVect* vt = as_LoadVector()->vect_type(); > > Why you need `as_LoadVector()` fro `this`? Same in `StoreVectorNode::Ideal(). Good catch and thanks! We could directly use "vect_type()" here. I will change this later. > src/hotspot/share/opto/vectornode.cpp line 988: > >> 986: } >> 987: } >> 988: return LoadNode::Ideal(phase, can_reshape); > > Should this call `LoadVectorNode::Ideal`? > I understand you did optimization because `vector_needs_partial_operations` is false for `LoadVectorMaskedNode` in aarch64 case. But what if it is different on some other (not current) platform? Right, calling `LoadVectorNode::Ideal()` is better. I will change this later. Thanks. > src/hotspot/share/opto/vectornode.cpp line 1008: > >> 1006: } >> 1007: } >> 1008: return StoreNode::Ideal(phase, can_reshape); > > Should this call `StoreVectorNode::Ideal`? ditto > src/hotspot/share/opto/vectornode.cpp line 1821: > >> 1819: // Transform (MaskAll m1 (VectorMaskGen len)) ==> (VectorMaskGen len) >> 1820: // if the vector length in bytes is lower than the MaxVectorSize. >> 1821: if (is_con_M1(in(1)) && length_in_bytes() < MaxVectorSize) { > > Due to #8877 such length check may not correct here. > And I don't see `in(2)->Opcode() == Op_VectorMaskGen` check. I think changes in #8877 influences the max vector length in superword? And since `MaskAll` is used for VectorAPI, the `MaxVectorSize` is still the right reference? @jatin-bhateja, could you please help to check whether this has any influence on x86 avx-512 system? Thanks so much! > And I don't see in(2)->Opcode() == Op_VectorMaskGen check. Yes, the `Op_VectorMaskGen` is not generated for `MaskAll` when its input is a constant. We directly transform the `MaskAll` to `VectorMaskGen` here, since they two have the same meanings. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/9037 From xgong at openjdk.java.net Wed Jun 8 03:06:36 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 8 Jun 2022 03:06:36 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 02:34:27 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/matcher.cpp line 2255: >> >>> 2253: case Op_FmaVF: >>> 2254: case Op_MacroLogicV: >>> 2255: case Op_LoadVectorMasked: >> >> Why it is removed? > > Thanks for looking at this PR! This is removed since we added two combined rules in the aarch64_sve.ad (line-2383) like: > > match(Set dst (VectorLoadMask (LoadVectorMasked mem pg))); > > Setting `Op_LoadVectorMasked` as a root in matcher will make such rules not be matched. I'm also curious why this op is added here. Does it have any influence if I remove it? Thanks! @jatin-bhateja, could you please help to check the influence about removing this? Kindly know your feedback about this. Thanks so much! ------------- PR: https://git.openjdk.java.net/jdk/pull/9037 From thartmann at openjdk.java.net Wed Jun 8 05:10:25 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 8 Jun 2022 05:10:25 GMT Subject: RFR: 8287700: C2 Crash running eclipse benchmark from Dacapo In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 07:48:42 GMT, Roland Westrelin wrote: > With 8275201 (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses), I made the following change to escape.cpp: > > > @@ -2628,7 +2632,7 @@ bool ConnectionGraph::split_AddP(Node *addp, Node *base) { > // this code branch will go away. > // > if (!t->is_known_instance() && > - !base_t->klass()->is_subtype_of(t->klass())) { > + !t->maybe_java_subtype_of(base_t)) { > return false; // bail out > } > const TypeOopPtr *tinst = base_t->add_offset(t->offset())->is_oopptr(); > @@ -3312,7 +3316,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > } else { > tn_t = tn_type->isa_oopptr(); > } > - if (tn_t != NULL && tinst->klasgs()->is_subtype_of(tn_t->klass())) { > + if (tn_t != NULL && tn_t->maybe_java_subtype_of(tinst)) { > if (tn_type->isa_narrowoop()) { > tn_type = tinst->make_narrowoop(); > } else { > @@ -3325,7 +3329,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > record_for_optimizer(n); > } else { > assert(tn_type == TypePtr::NULL_PTR || > - tn_t != NULL && !tinst->klass()->is_subtype_of(tn_t->klass()), > + tn_t != NULL && !tinst->is_java_subtype_of(tn_t), > "unexpected type"); > continue; // Skip dead path with different type > } > > > Where I inverted the subtype and supertype in a subtype check (that is `tn_t->maybe_java_subtype_of(tinst)` when it was `tinst->klasgs()->is_subtype_of(tn_t->klass())`) in 2 places for no good reason AFAICT now. The assert used to also test the same condition as the if above but I changed that by mistake. This fixes addresses both issues. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9054 From thartmann at openjdk.java.net Wed Jun 8 05:15:32 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 8 Jun 2022 05:15:32 GMT Subject: RFR: 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect [v2] In-Reply-To: References: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> Message-ID: On Tue, 7 Jun 2022 15:08:29 GMT, Roland Westrelin wrote: >> It's another case where because of overunrolling, the main loop is >> never executed but not optimized out and the type of some >> CastII/ConvI2L for a range check conflicts with the type of its input >> resulting in a broken graph for the main loop. >> >> This is supposed to have been solved by skeleton predicates. There's >> indeed a predicate that should catch that the loop is unreachable but >> it doesn't constant fold. The shape of the predicate is: >> >> (CmpUL (SubL 15 (ConvI2L (AddI (CastII int:>=1) 15) minint..maxint)) 16) >> >> I propose adding a CastII, that is in this case: >> >> (CmpUL (SubL 15 (ConvI2L (CastII (AddI (CastII int:>=1) 15) 0..max-1) minint..maxint)) 16) >> >> The justification for the CastII is that the skeleton predicate is a >> predicate for a specific iteration of the loop. That iteration of the >> loop must be in the range of the iv Phi. >> >> With the extra CastII, the AddI can be pushed through the CastII and >> ConvI2L and the check constant folds. Actually, with the extra CastII, >> the predicate is not implemented with a CmpUL but a CmpU because the >> code can tell there's no risk of overflow (I did force the use of >> CmpUL as an experiment and the CmpUL does constant fold) > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/loopopts/TestOverUnrolling2.java > > Co-authored-by: Tobias Hartmann All tests passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8996 From thartmann at openjdk.java.net Wed Jun 8 05:16:39 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 8 Jun 2022 05:16:39 GMT Subject: RFR: 8286451: C2: assert(nb == 1) failed: only when the head is not shared [v2] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 15:08:43 GMT, Roland Westrelin wrote: >> nb counts the number of loops that share a single head. The assert >> that fires is in code that handles the case of a self loop (a loop >> composed of a single block). There can be a self loop and multiple >> loops that share a head: the assert makes little sense and I propose >> to simply remove it. >> >> I think there's another issue with this code: in the case of a self >> loop and multiple loops that share a head, the self loop can be any of >> the loop for which the head is cloned not only the one that's passed >> as argument to ciTypeFlow::clone_loop_head(). As a consequence, I >> moved the logic for self loops in the loop that's applied to all loops >> that share the loop head. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-8286451 > - fix & test All tests passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8947 From chagedorn at openjdk.java.net Wed Jun 8 05:35:31 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 8 Jun 2022 05:35:31 GMT Subject: RFR: 8285965: TestScenarios.java does not check for "" correctly In-Reply-To: References: Message-ID: On Wed, 11 May 2022 06:13:12 GMT, Christian Hagedorn wrote: > This is another rare occurrence of `` that is not handled correctly by `TestScenarios.java`. > > We wrongly search this safepoint message in the test VM output with `getTestVMOutput()`: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/Utils.java#L44-L53 > > But this does not help since the IR matcher is parsing the `hotspot_pid` file for IR matching and not the test VM output. We could therefore find this safepoint message in the `hotspod_pid` file and bail out of IR matching while the test VM output does not contain it. This lets `TestScenarios.java` fail. > > The fix we did for other IR framework tests is to redirect the output of the JTreg test VM itself to a stream in order to search it for ``. We are dumping this message as part of a warning when the IR matcher bails out: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/IRMatcher.java#L86-L96 > > Output for the reported failure: > > Scenario #3 - [-XX:TLABRefillWasteFraction=53]: > [...] > Found , bail out of IR matching > > > I suggest to use the same fix for `TestScenarios`. > > Thanks, > Christian Thanks Vladimir for your review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8647 From fjiang at openjdk.java.net Wed Jun 8 06:12:58 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Wed, 8 Jun 2022 06:12:58 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing Message-ID: The following tests are failing with "assert(n_type->isa_vect() == __null || lrg._is_vector || ireg == Op_RegD || ireg == Op_RegL || ireg == Op_RegVectMask) failed: vector must be in vector registers" because C2 instruct "vpopcountI" stores the result into a general-purpose register (GPR?instead of a vector register: jdk/incubator/vector/Byte256VectorTests.java jdk/incubator/vector/ByteMaxVectorTests.java jdk/incubator/vector/Int256VectorTests.java jdk/incubator/vector/IntMaxVectorTests.java jdk/incubator/vector/Short256VectorTests.java jdk/incubator/vector/ShortMaxVectorTests.java [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added a new vector operation VectorOperations.BIT_COUNT, which needs the support of PopCountV*. Currently, riscv vector extension `vpopc.m` instruction counts the number of mask elements of the active elements of the vector source mask register that has the value 1 and writes the result to a scalar x register. [1] So we decide to remove the vpopcountI instruct for now. [1]: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf Additional Tests: - [x] jdk/incubator/vector (release with UseRVV on QEMU) - [ ] hotspot:tier1 (release with UseRVV on QEMU) ------------- Commit messages: - remove C2 instruction vpopcountI Changes: https://git.openjdk.java.net/jdk/pull/9079/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9079&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287970 Stats: 12 lines in 1 file changed: 0 ins; 12 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/9079.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9079/head:pull/9079 PR: https://git.openjdk.java.net/jdk/pull/9079 From roland at openjdk.java.net Wed Jun 8 06:38:37 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 8 Jun 2022 06:38:37 GMT Subject: RFR: 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect [v2] In-Reply-To: References: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> Message-ID: On Wed, 8 Jun 2022 05:12:24 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/loopopts/TestOverUnrolling2.java >> >> Co-authored-by: Tobias Hartmann > > All tests passed. @TobiHartmann @chhagedorn thanks for the reviews ------------- PR: https://git.openjdk.java.net/jdk/pull/8996 From roland at openjdk.java.net Wed Jun 8 06:38:41 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 8 Jun 2022 06:38:41 GMT Subject: Integrated: 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect In-Reply-To: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> References: <-ZfbcgBcRabQqlggf35uK2HTi-1MSnCCZBV1qwRrT8E=.ae2012fd-6e54-4a16-9070-85f08b74beb6@github.com> Message-ID: On Thu, 2 Jun 2022 15:43:09 GMT, Roland Westrelin wrote: > It's another case where because of overunrolling, the main loop is > never executed but not optimized out and the type of some > CastII/ConvI2L for a range check conflicts with the type of its input > resulting in a broken graph for the main loop. > > This is supposed to have been solved by skeleton predicates. There's > indeed a predicate that should catch that the loop is unreachable but > it doesn't constant fold. The shape of the predicate is: > > (CmpUL (SubL 15 (ConvI2L (AddI (CastII int:>=1) 15) minint..maxint)) 16) > > I propose adding a CastII, that is in this case: > > (CmpUL (SubL 15 (ConvI2L (CastII (AddI (CastII int:>=1) 15) 0..max-1) minint..maxint)) 16) > > The justification for the CastII is that the skeleton predicate is a > predicate for a specific iteration of the loop. That iteration of the > loop must be in the range of the iv Phi. > > With the extra CastII, the AddI can be pushed through the CastII and > ConvI2L and the check constant folds. Actually, with the extra CastII, > the predicate is not implemented with a CmpUL but a CmpU because the > code can tell there's no risk of overflow (I did force the use of > CmpUL as an experiment and the CmpUL does constant fold) This pull request has now been integrated. Changeset: 590337e2 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/590337e2f229445e353e7c32e0dcff8d93e412d2 Stats: 64 lines in 3 files changed: 63 ins; 0 del; 1 mod 8286625: C2 fails with assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8996 From roland at openjdk.java.net Wed Jun 8 06:39:37 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 8 Jun 2022 06:39:37 GMT Subject: RFR: 8287700: C2 Crash running eclipse benchmark from Dacapo In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 14:57:07 GMT, Christian Hagedorn wrote: >> With 8275201 (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses), I made the following change to escape.cpp: >> >> >> @@ -2628,7 +2632,7 @@ bool ConnectionGraph::split_AddP(Node *addp, Node *base) { >> // this code branch will go away. >> // >> if (!t->is_known_instance() && >> - !base_t->klass()->is_subtype_of(t->klass())) { >> + !t->maybe_java_subtype_of(base_t)) { >> return false; // bail out >> } >> const TypeOopPtr *tinst = base_t->add_offset(t->offset())->is_oopptr(); >> @@ -3312,7 +3316,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, >> } else { >> tn_t = tn_type->isa_oopptr(); >> } >> - if (tn_t != NULL && tinst->klasgs()->is_subtype_of(tn_t->klass())) { >> + if (tn_t != NULL && tn_t->maybe_java_subtype_of(tinst)) { >> if (tn_type->isa_narrowoop()) { >> tn_type = tinst->make_narrowoop(); >> } else { >> @@ -3325,7 +3329,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, >> record_for_optimizer(n); >> } else { >> assert(tn_type == TypePtr::NULL_PTR || >> - tn_t != NULL && !tinst->klass()->is_subtype_of(tn_t->klass()), >> + tn_t != NULL && !tinst->is_java_subtype_of(tn_t), >> "unexpected type"); >> continue; // Skip dead path with different type >> } >> >> >> Where I inverted the subtype and supertype in a subtype check (that is `tn_t->maybe_java_subtype_of(tinst)` when it was `tinst->klasgs()->is_subtype_of(tn_t->klass())`) in 2 places for no good reason AFAICT now. The assert used to also test the same condition as the if above but I changed that by mistake. This fixes addresses both issues. > > That looks good to me. @chhagedorn @vnkozlov @navyxliu @TobiHartmann thanks for the reviews ------------- PR: https://git.openjdk.java.net/jdk/pull/9054 From roland at openjdk.java.net Wed Jun 8 06:39:39 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 8 Jun 2022 06:39:39 GMT Subject: Integrated: 8287700: C2 Crash running eclipse benchmark from Dacapo In-Reply-To: References: Message-ID: <4ZTyUAnMSL-qGww_H8MvUG-8kWKXO-BsUOf8b6dyvxI=.b18b0ea1-faf6-4c66-96a6-845279c9ae9f@github.com> On Tue, 7 Jun 2022 07:48:42 GMT, Roland Westrelin wrote: > With 8275201 (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses), I made the following change to escape.cpp: > > > @@ -2628,7 +2632,7 @@ bool ConnectionGraph::split_AddP(Node *addp, Node *base) { > // this code branch will go away. > // > if (!t->is_known_instance() && > - !base_t->klass()->is_subtype_of(t->klass())) { > + !t->maybe_java_subtype_of(base_t)) { > return false; // bail out > } > const TypeOopPtr *tinst = base_t->add_offset(t->offset())->is_oopptr(); > @@ -3312,7 +3316,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > } else { > tn_t = tn_type->isa_oopptr(); > } > - if (tn_t != NULL && tinst->klasgs()->is_subtype_of(tn_t->klass())) { > + if (tn_t != NULL && tn_t->maybe_java_subtype_of(tinst)) { > if (tn_type->isa_narrowoop()) { > tn_type = tinst->make_narrowoop(); > } else { > @@ -3325,7 +3329,7 @@ void ConnectionGraph::split_unique_types(GrowableArray &alloc_worklist, > record_for_optimizer(n); > } else { > assert(tn_type == TypePtr::NULL_PTR || > - tn_t != NULL && !tinst->klass()->is_subtype_of(tn_t->klass()), > + tn_t != NULL && !tinst->is_java_subtype_of(tn_t), > "unexpected type"); > continue; // Skip dead path with different type > } > > > Where I inverted the subtype and supertype in a subtype check (that is `tn_t->maybe_java_subtype_of(tinst)` when it was `tinst->klasgs()->is_subtype_of(tn_t->klass())`) in 2 places for no good reason AFAICT now. The assert used to also test the same condition as the if above but I changed that by mistake. This fixes addresses both issues. This pull request has now been integrated. Changeset: 0960ecc4 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/0960ecc407f8049903e3d183ac75c6a85dcc5b5f Stats: 76 lines in 2 files changed: 73 ins; 0 del; 3 mod 8287700: C2 Crash running eclipse benchmark from Dacapo Reviewed-by: chagedorn, kvn, xliu, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/9054 From roland at openjdk.java.net Wed Jun 8 06:45:22 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 8 Jun 2022 06:45:22 GMT Subject: Integrated: 8286451: C2: assert(nb == 1) failed: only when the head is not shared In-Reply-To: References: Message-ID: On Mon, 30 May 2022 13:50:08 GMT, Roland Westrelin wrote: > nb counts the number of loops that share a single head. The assert > that fires is in code that handles the case of a self loop (a loop > composed of a single block). There can be a self loop and multiple > loops that share a head: the assert makes little sense and I propose > to simply remove it. > > I think there's another issue with this code: in the case of a self > loop and multiple loops that share a head, the self loop can be any of > the loop for which the head is cloned not only the one that's passed > as argument to ciTypeFlow::clone_loop_head(). As a consequence, I > moved the logic for self loops in the loop that's applied to all loops > that share the loop head. This pull request has now been integrated. Changeset: bf0e625f Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/bf0e625fe0e83c00006f13367a67e9f6175d21e4 Stats: 82 lines in 2 files changed: 65 ins; 14 del; 3 mod 8286451: C2: assert(nb == 1) failed: only when the head is not shared Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8947 From roland at openjdk.java.net Wed Jun 8 06:45:21 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 8 Jun 2022 06:45:21 GMT Subject: RFR: 8286451: C2: assert(nb == 1) failed: only when the head is not shared [v2] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 14:51:50 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - review >> - Merge branch 'master' into JDK-8286451 >> - fix & test > > That looks good, thanks for doing the updates! @chhagedorn @TobiHartmann thanks for the reviews ------------- PR: https://git.openjdk.java.net/jdk/pull/8947 From kvn at openjdk.java.net Wed Jun 8 07:22:18 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 8 Jun 2022 07:22:18 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 06:06:08 GMT, Feilong Jiang wrote: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added a new vector operation VectorOperations.BIT_COUNT, which needs the support of PopCountV*. The following tests failed when enabling `UseRVV`: > > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/ByteMaxVectorTests.java > jdk/incubator/vector/Int256VectorTests.java > jdk/incubator/vector/IntMaxVectorTests.java > jdk/incubator/vector/Short256VectorTests.java > jdk/incubator/vector/ShortMaxVectorTests.java > > Tests are failing with "assert(n_type->isa_vect() == __null || lrg._is_vector || ireg == Op_RegD || ireg == Op_RegL || ireg == Op_RegVectMask) failed: vector must be in vector registers" because C2 instruct "vpopcountI" stores the result into a general-purpose register (GPR) instead of a vector register. > > Currently, riscv vector extension `vpopc.m` instruction counts the number of mask elements of the active elements of the vector source mask register that has the value 1 and writes the result to a scalar x register. [1] `PopCountV*` needs to write back the pop counting results to vector registers, there is no single instruction in rvv that can satisfy the requirement. So we decide to remove the vpopcountI instruct for now. > > [1]: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf > > Additional Tests: > - [x] jdk/incubator/vector (release with UseRVV on QEMU) > - [ ] hotspot:tier1 (release with UseRVV on QEMU) Can you instead add Op_PopCountVI to op_vec_supported()? Current change looks trivial and fine too. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9079 From jiefu at openjdk.java.net Wed Jun 8 07:43:15 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 8 Jun 2022 07:43:15 GMT Subject: RFR: 8288000: compiler/loopopts/TestOverUnrolling2.java fails with release VMs Message-ID: Hi all, Please review this trivial change which fixes the failure of compiler/loopopts/TestOverUnrolling2.java with release VMs. Only `-XX:+UnlockDiagnosticVMOptions` is added. Thanks. Best regards, Jie ------------- Commit messages: - 8288000: compiler/loopopts/TestOverUnrolling2.java fails with release VMs Changes: https://git.openjdk.java.net/jdk/pull/9080/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9080&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288000 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/9080.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9080/head:pull/9080 PR: https://git.openjdk.java.net/jdk/pull/9080 From roland at openjdk.java.net Wed Jun 8 07:50:25 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 8 Jun 2022 07:50:25 GMT Subject: RFR: 8288000: compiler/loopopts/TestOverUnrolling2.java fails with release VMs In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 07:33:11 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial change which fixes the failure of compiler/loopopts/TestOverUnrolling2.java with release VMs. > Only `-XX:+UnlockDiagnosticVMOptions` is added. > > Thanks. > Best regards, > Jie Looks good to me. Thanks for fixing that. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9080 From fjiang at openjdk.java.net Wed Jun 8 07:59:14 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Wed, 8 Jun 2022 07:59:14 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added a new vector operation VectorOperations.BIT_COUNT, which needs the support of PopCountV*. The following tests failed when enabling `UseRVV`: > > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/ByteMaxVectorTests.java > jdk/incubator/vector/Int256VectorTests.java > jdk/incubator/vector/IntMaxVectorTests.java > jdk/incubator/vector/Short256VectorTests.java > jdk/incubator/vector/ShortMaxVectorTests.java > > Tests are failing with "assert(n_type->isa_vect() == __null || lrg._is_vector || ireg == Op_RegD || ireg == Op_RegL || ireg == Op_RegVectMask) failed: vector must be in vector registers" because C2 instruct "vpopcountI" stores the result into a general-purpose register (GPR) instead of a vector register. > > Currently, riscv vector extension `vpopc.m` instruction counts the number of mask elements of the active elements of the vector source mask register that has the value 1 and writes the result to a scalar x register. [1] `PopCountV*` needs to write back the pop counting results to vector registers, there is no single instruction in rvv that can satisfy the requirement. So we decide to remove the vpopcountI instruct for now. > > [1]: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf > > Additional Tests: > - [x] jdk/incubator/vector (release with UseRVV on QEMU) > - [ ] hotspot:tier1 (release with UseRVV on QEMU) Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: disable Op_PopCountV* in op_vec_supported ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/9079/files - new: https://git.openjdk.java.net/jdk/pull/9079/files/6a05b12c..958f1042 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9079&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9079&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/9079.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9079/head:pull/9079 PR: https://git.openjdk.java.net/jdk/pull/9079 From thartmann at openjdk.java.net Wed Jun 8 08:00:37 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 8 Jun 2022 08:00:37 GMT Subject: RFR: 8288000: compiler/loopopts/TestOverUnrolling2.java fails with release VMs In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 07:33:11 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial change which fixes the failure of compiler/loopopts/TestOverUnrolling2.java with release VMs. > Only `-XX:+UnlockDiagnosticVMOptions` is added. > > Thanks. > Best regards, > Jie Good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9080 From fjiang at openjdk.java.net Wed Jun 8 08:02:32 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Wed, 8 Jun 2022 08:02:32 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 07:19:18 GMT, Vladimir Kozlov wrote: > Can you instead add Op_PopCountVI to op_vec_supported()? Current change looks trivial and fine too. done ------------- PR: https://git.openjdk.java.net/jdk/pull/9079 From fyang at openjdk.java.net Wed Jun 8 08:06:28 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Wed, 8 Jun 2022 08:06:28 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: <3LLqbYY2C1ptcSu15S_VARpw1avYt8QVZrG5uf9LISI=.5dcbb628-a9e8-4967-b329-1e8b3e81448e@github.com> On Wed, 8 Jun 2022 07:59:14 GMT, Feilong Jiang wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added a new vector operation VectorOperations.BIT_COUNT, which needs the support of PopCountV*. The following tests failed when enabling `UseRVV`: >> >> jdk/incubator/vector/Byte256VectorTests.java >> jdk/incubator/vector/ByteMaxVectorTests.java >> jdk/incubator/vector/Int256VectorTests.java >> jdk/incubator/vector/IntMaxVectorTests.java >> jdk/incubator/vector/Short256VectorTests.java >> jdk/incubator/vector/ShortMaxVectorTests.java >> >> Tests are failing with "assert(n_type->isa_vect() == __null || lrg._is_vector || ireg == Op_RegD || ireg == Op_RegL || ireg == Op_RegVectMask) failed: vector must be in vector registers" because C2 instruct "vpopcountI" stores the result into a general-purpose register (GPR) instead of a vector register. >> >> Currently, riscv vector extension `vpopc.m` instruction counts the number of mask elements of the active elements of the vector source mask register that has the value 1 and writes the result to a scalar x register. [1] `PopCountV*` needs to write back the pop counting results to vector registers, there is no single instruction in rvv that can satisfy the requirement. So we decide to remove the vpopcountI instruct for now. >> >> [1]: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf >> >> Additional Tests: >> - [x] jdk/incubator/vector (release with UseRVV on QEMU) >> - [ ] hotspot:tier1 (release with UseRVV on QEMU) > > Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: > > disable Op_PopCountV* in op_vec_supported Changes looks fine and safe to me. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9079 From aph at openjdk.java.net Wed Jun 8 08:06:31 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 8 Jun 2022 08:06:31 GMT Subject: RFR: 8287903: Reduce runtime of java.math microbenchmarks In-Reply-To: References: Message-ID: <6dJJ1EmuX98GyD1dGI9wedSIgIU8DBqlynli1_Gbpys=.662230b9-a380-4851-8acc-964b8f01e0e8@github.com> On Tue, 7 Jun 2022 12:34:25 GMT, Claes Redestad wrote: > - Reduce runtime by running fewer forks, fewer iterations, less warmup. All micros tested in this group appear to stabilize very quickly. > - Refactor BigIntegers to avoid re-running some (most) micros over and over with parameter values that don't affect them. > > Expected runtime down from 14 hours to 15 minutes. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/9062 From jiefu at openjdk.java.net Wed Jun 8 08:12:31 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 8 Jun 2022 08:12:31 GMT Subject: RFR: 8288000: compiler/loopopts/TestOverUnrolling2.java fails with release VMs In-Reply-To: References: Message-ID: <6qhyAqk37Z5NVaQcw855Mx5yRpHPwgIHkO2gcSWfSoQ=.d57c249e-6745-43f3-a7b7-768e5ce3ad85@github.com> On Wed, 8 Jun 2022 07:46:57 GMT, Roland Westrelin wrote: >> Hi all, >> >> Please review this trivial change which fixes the failure of compiler/loopopts/TestOverUnrolling2.java with release VMs. >> Only `-XX:+UnlockDiagnosticVMOptions` is added. >> >> Thanks. >> Best regards, >> Jie > > Looks good to me. Thanks for fixing that. Thanks @rwestrel and @TobiHartmann for the review. ------------- PR: https://git.openjdk.java.net/jdk/pull/9080 From jiefu at openjdk.java.net Wed Jun 8 08:12:32 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 8 Jun 2022 08:12:32 GMT Subject: Integrated: 8288000: compiler/loopopts/TestOverUnrolling2.java fails with release VMs In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 07:33:11 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial change which fixes the failure of compiler/loopopts/TestOverUnrolling2.java with release VMs. > Only `-XX:+UnlockDiagnosticVMOptions` is added. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: d959c22a Author: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/d959c22a9574359e2d5134ac8365e8a9df4f7cef Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8288000: compiler/loopopts/TestOverUnrolling2.java fails with release VMs Reviewed-by: roland, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/9080 From duke at openjdk.java.net Wed Jun 8 08:13:38 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Wed, 8 Jun 2022 08:13:38 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive [v3] In-Reply-To: References: Message-ID: On Tue, 31 May 2022 03:22:46 GMT, Yuta Sato wrote: >> When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , >> it shows totally the same message for c1 and c2 compiler and the user would be confused about >> which compiler is affected by this message. >> This should show messages with their compiler name so that the user knows which compiler shows this message. >> >> My change result would be like the below. >> >> >> OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> >> -> >> >> OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > > Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: > > add const to method Could anyone help to be a sponsor for this? ------------- PR: https://git.openjdk.java.net/jdk/pull/8591 From epeter at openjdk.java.net Wed Jun 8 08:17:35 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Wed, 8 Jun 2022 08:17:35 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v29] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 16:59:53 GMT, Roberto Casta?eda Lozano wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review from @TobiHartmann >> >> Thank you @TobiHartmann >> >> Co-authored-by: Tobias Hartmann > > src/hotspot/share/opto/node.cpp line 1997: > >> 1995: tty->print(" H: display this help info, with examples\n"); >> 1996: tty->print(" +: traverse in-edges (on if neither + nor -)\n"); >> 1997: tty->print(" -: traverse out-edges\n"); > > I would find it more natural if out-edges were also traversed by default, but it is a matter of taste. I think I will keep it, as `dump` also by default only traverses in-edges. Additionally, it can be quite confusing to traverse input and output edges at the same time. ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From dlong at openjdk.java.net Wed Jun 8 08:21:39 2022 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 8 Jun 2022 08:21:39 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: <_0CxOQF2jmc3FxN3DawcQtqvrFIKCgmfLbyqzmV5Cvs=.07d718a6-0285-44f6-b157-07b27273cfbf@github.com> On Wed, 8 Jun 2022 07:59:14 GMT, Feilong Jiang wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added a new vector operation VectorOperations.BIT_COUNT, which needs the support of PopCountV*. The following tests failed when enabling `UseRVV`: >> >> jdk/incubator/vector/Byte256VectorTests.java >> jdk/incubator/vector/ByteMaxVectorTests.java >> jdk/incubator/vector/Int256VectorTests.java >> jdk/incubator/vector/IntMaxVectorTests.java >> jdk/incubator/vector/Short256VectorTests.java >> jdk/incubator/vector/ShortMaxVectorTests.java >> >> Tests are failing with "assert(n_type->isa_vect() == __null || lrg._is_vector || ireg == Op_RegD || ireg == Op_RegL || ireg == Op_RegVectMask) failed: vector must be in vector registers" because C2 instruct "vpopcountI" stores the result into a general-purpose register (GPR) instead of a vector register. >> >> Currently, riscv vector extension `vpopc.m` instruction counts the number of mask elements of the active elements of the vector source mask register that has the value 1 and writes the result to a scalar x register. [1] `PopCountV*` needs to write back the pop counting results to vector registers, there is no single instruction in rvv that can satisfy the requirement. So we decide to remove the vpopcountI instruct for now. >> >> [1]: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf >> >> Additional Tests: >> - [x] jdk/incubator/vector (release with UseRVV on QEMU) >> - [ ] hotspot:tier1 (release with UseRVV on QEMU) > > Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: > > disable Op_PopCountV* in op_vec_supported Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/9079 From epeter at openjdk.java.net Wed Jun 8 08:23:32 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Wed, 8 Jun 2022 08:23:32 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v29] In-Reply-To: References: Message-ID: <9yUS-NnoZ6BlPIFLEC3yYme0Muq9Eyut_0EqGGgUubk=.c7ee3b7c-1d90-4529-ad91-99c7da9c9f14@github.com> On Tue, 7 Jun 2022 17:01:52 GMT, Roberto Casta?eda Lozano wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review from @TobiHartmann >> >> Thank you @TobiHartmann >> >> Co-authored-by: Tobias Hartmann > > src/hotspot/share/opto/node.cpp line 2310: > >> 2308: // To find all options, run: >> 2309: // find_node(0)->dump_bfs(0,0,"H") >> 2310: void Node::dump_bfs(const int max_distance, Node* target, char const* options) { > > Would be great to have a short version `Node::dump_bfs(n)` equivalent to `Node::dump_bfs(n, 0, 0)`, for convenience. great idea, I will do that! > src/hotspot/share/opto/node.hpp line 1193: > >> 1191: Node* find(int idx, bool only_ctrl = false); // Search the graph for the given idx. >> 1192: Node* find_ctrl(int idx); // Search control ancestors for the given idx. >> 1193: void dump_bfs(const int max_distance, Node* target, char const* options); // Print BFS traversal > > Suggestion: `char const* options` -> `const char* options` (same for the other occurrences in the changeset). thanks, somehow I got the order a little messed up ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From yadongwang at openjdk.java.net Wed Jun 8 09:12:35 2022 From: yadongwang at openjdk.java.net (Yadong Wang) Date: Wed, 8 Jun 2022 09:12:35 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 07:59:14 GMT, Feilong Jiang wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added a new vector operation VectorOperations.BIT_COUNT, which needs the support of PopCountV*. The following tests failed when enabling `UseRVV`: >> >> jdk/incubator/vector/Byte256VectorTests.java >> jdk/incubator/vector/ByteMaxVectorTests.java >> jdk/incubator/vector/Int256VectorTests.java >> jdk/incubator/vector/IntMaxVectorTests.java >> jdk/incubator/vector/Short256VectorTests.java >> jdk/incubator/vector/ShortMaxVectorTests.java >> >> Tests are failing with "assert(n_type->isa_vect() == __null || lrg._is_vector || ireg == Op_RegD || ireg == Op_RegL || ireg == Op_RegVectMask) failed: vector must be in vector registers" because C2 instruct "vpopcountI" stores the result into a general-purpose register (GPR) instead of a vector register. >> >> Currently, riscv vector extension `vpopc.m` instruction counts the number of mask elements of the active elements of the vector source mask register that has the value 1 and writes the result to a scalar x register. [1] `PopCountV*` needs to write back the pop counting results to vector registers, there is no single instruction in rvv that can satisfy the requirement. So we decide to remove the vpopcountI instruct for now. >> >> [1]: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf >> >> Additional Tests: >> - [x] jdk/incubator/vector (release with UseRVV on QEMU) >> - [x] hotspot:tier1 (release with UseRVV on QEMU without new failure) > > Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: > > disable Op_PopCountV* in op_vec_supported Please make sure SuperWord works properly without the match rule of PopCountVI and PopCountVL. See hotspot/jtreg/compiler/vectorization/TestPopCountVector.java and hotspot/jtreg/compiler/vectorization/TestPopCountVectorLong.java. ------------- PR: https://git.openjdk.java.net/jdk/pull/9079 From epeter at openjdk.java.net Wed Jun 8 09:23:39 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Wed, 8 Jun 2022 09:23:39 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v30] In-Reply-To: References: Message-ID: > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > To get familiar with the many options, run this to get help: > `find_node(0)->dump_bfs(0,0,"h")` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > dist dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > dist dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > dist dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > dist [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > dist dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **Exploring loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > dist dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > dist apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > dist apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > dist apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: response to review feedback ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/abd53b02..f86320a9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=29 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=28-29 Stats: 33 lines in 2 files changed: 19 ins; 0 del; 14 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From epeter at openjdk.java.net Wed Jun 8 09:29:37 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Wed, 8 Jun 2022 09:29:37 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v29] In-Reply-To: References: Message-ID: <4F7ZbiTW9XWNwqZCbFzPmi40mNwHwsyuUFo8gGK4dGI=.62c29bfc-26fb-4130-8f2c-56bf9bf37ecc@github.com> On Tue, 7 Jun 2022 17:07:26 GMT, Roberto Casta?eda Lozano wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review from @TobiHartmann >> >> Thank you @TobiHartmann >> >> Co-authored-by: Tobias Hartmann > > src/hotspot/share/opto/node.cpp line 2240: > >> 2238: tty->print(" _"); >> 2239: } else { >> 2240: print_node_idx(b->head()); > > I think it would also be useful to print the block identifier, i.e. `print("B%d", b->_pre_order)`. added it, thanks for the suggestion! > src/hotspot/share/opto/node.cpp line 2278: > >> 2276: } >> 2277: if (_print_blocks) { >> 2278: tty->print(" [head idom d]"); // block > > Perhaps rename the third block column `d` to something more descriptive (`depth` or similar) for clarity, that would also break the ambiguity with `d` which is already used for "distance". I renamed d -> dist, and d -> depth. That should make things clearer. ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Wed Jun 8 09:39:23 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 8 Jun 2022 09:39:23 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v2] In-Reply-To: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: > Hi, > > This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: - remove comments - review comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/9068/files - new: https://git.openjdk.java.net/jdk/pull/9068/files/01c0a07c..b5627135 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9068&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9068&range=00-01 Stats: 44 lines in 3 files changed: 32 ins; 12 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/9068.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9068/head:pull/9068 PR: https://git.openjdk.java.net/jdk/pull/9068 From duke at openjdk.java.net Wed Jun 8 09:39:23 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 8 Jun 2022 09:39:23 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v2] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Tue, 7 Jun 2022 17:41:13 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: >> >> - remove comments >> - review comments > > src/hotspot/share/opto/subnode.hpp line 217: > >> 215: //------------------------------CmpU3Node-------------------------------------- >> 216: // Compare 2 unsigned values, returning integer value (-1, 0 or 1). >> 217: class CmpU3Node : public CmpUNode { > > Place it after `CmpUNode` class. Done ------------- PR: https://git.openjdk.java.net/jdk/pull/9068 From duke at openjdk.java.net Wed Jun 8 09:42:32 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 8 Jun 2022 09:42:32 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long In-Reply-To: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Tue, 7 Jun 2022 17:14:18 GMT, Quan Anh Mai wrote: > Hi, > > This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . > > Thank you very much. I have added a benchmark for the intrinsic. The result is as follows, thanks a lot: Before After Benchmark (size) Mode Cnt Score Error Score Error Units Integers.compareUnsigned 500 avgt 15 0.527 ? 0.002 0.498 ? 0.011 us/op Longs.compareUnsigned 500 avgt 15 0.677 ? 0.014 0.561 ? 0.006 us/op ------------- PR: https://git.openjdk.java.net/jdk/pull/9068 From yadongwang at openjdk.java.net Wed Jun 8 09:51:15 2022 From: yadongwang at openjdk.java.net (Yadong Wang) Date: Wed, 8 Jun 2022 09:51:15 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 07:59:14 GMT, Feilong Jiang wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added a new vector operation VectorOperations.BIT_COUNT, which needs the support of PopCountV*. The following tests failed when enabling `UseRVV`: >> >> jdk/incubator/vector/Byte256VectorTests.java >> jdk/incubator/vector/ByteMaxVectorTests.java >> jdk/incubator/vector/Int256VectorTests.java >> jdk/incubator/vector/IntMaxVectorTests.java >> jdk/incubator/vector/Short256VectorTests.java >> jdk/incubator/vector/ShortMaxVectorTests.java >> >> Tests are failing with "assert(n_type->isa_vect() == __null || lrg._is_vector || ireg == Op_RegD || ireg == Op_RegL || ireg == Op_RegVectMask) failed: vector must be in vector registers" because C2 instruct "vpopcountI" stores the result into a general-purpose register (GPR) instead of a vector register. >> >> Currently, riscv vector extension `vpopc.m` instruction counts the number of mask elements of the active elements of the vector source mask register that has the value 1 and writes the result to a scalar x register. [1] `PopCountV*` needs to write back the pop counting results to vector registers, there is no single instruction in rvv that can satisfy the requirement. So we decide to remove the vpopcountI instruct for now. >> >> [1]: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf >> >> Additional Tests: >> - [x] jdk/incubator/vector (release with UseRVV on QEMU) >> - [x] hotspot:tier1 (release with UseRVV on QEMU without new failure) > > Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: > > disable Op_PopCountV* in op_vec_supported lgtm ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.java.net/jdk/pull/9079 From rcastanedalo at openjdk.java.net Wed Jun 8 10:10:36 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 8 Jun 2022 10:10:36 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v30] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 09:23:39 GMT, Emanuel Peter wrote: >> **What this gives you for the debugger** >> - BFS traversal (inputs / outputs) >> - node filtering by category >> - shortest path between nodes >> - all paths between nodes >> - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) >> - and more >> >> **Some usecases** >> - more readable `dump` >> - follow only nodes of some categories (only control, only data, etc) >> - find which control nodes depend on data node (visit data nodes, include control in boundary) >> - how two nodes relate (shortest / all paths, following input/output nodes, or both) >> - find loops (control / memory / data: call all paths with node as start and target) >> >> **Description** >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. >> >> `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` >> >> To get familiar with the many options, run this to get help: >> `find_node(0)->dump_bfs(0,0,"h")` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. >> >> **Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. >> 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. >> 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. >> 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. >> 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. >> 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. >> 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! >> >> Example (BFS inputs): >> >> (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") >> dist dump >> --------------------------------------------- >> 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> >> >> Example (BFS control inputs): >> >> (rr) p find_node(163)->dump_bfs(5,0,"c+") >> dist dump >> --------------------------------------------- >> 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] >> 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 >> >> We see the control flow of a strip mined loop. >> >> >> Experiment (BFS only data, but display all nodes on boundary) >> >> (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") >> dist dump >> --------------------------------------------- >> 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) >> 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) >> 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) >> 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) >> 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) >> 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] >> >> We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). >> >> Example with Mach nodes: >> >> (rr) p find_node(280)->dump_bfs(2,0,"cdmxo+ at B") >> dist [block head idom depth] old dump >> --------------------------------------------- >> 2 B6 379 377 4 o118 38 sarI_rReg_CL === _ 39 40 [[ 41 36 31 31 71 75 66 82 86 103 116 148 152 161 119 119 184 186 170 281 268 ]] !jvms: String::length @ bci:9 (line 1487) ByteVector::putUTF8 @ bci:1 (line 285) >> 2 B52 441 277 23 o738 283 incI_rReg === _ 285 [[ 284 285 281 ]] #1/0x00000001 !jvms: ByteVector::putUTF8 @ bci:131 (line 300) >> 2 B50 277 439 22 o756 282 IfTrue === 273 [[ 441 ]] #1 !jvms: ByteVector::putUTF8 @ bci:100 (line 302) >> 1 B52 441 277 23 o737 281 compI_rReg === _ 283 38 [[ 280 ]] >> 1 B52 441 277 23 _ 441 Region === 441 282 [[ 441 280 290 ]] >> 0 B52 441 277 23 o757 280 jmpLoopEnd === 441 281 [[ 279 347 ]] P=0.500000, C=21462.000000 !jvms: ByteVector::putUTF8 @ bci:79 (line 300) >> >> And the query on the old nodes: >> >> (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") >> dist dump >> --------------------------------------------- >> 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] >> 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] >> 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 >> 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] >> 1 o740 Bool === _ o739 [[ o741 ]] [lt] >> 1 o179 IfTrue === o178 [[ o741 ]] #1 >> 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 >> >> >> **Exploring loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "c+")` >> This provides us with a shortest control path, given this path has a distance of at most 20. >> >> Example (shortest path over control nodes): >> >> (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") >> dist dump >> --------------------------------------------- >> 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> >> Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. >> >> Example (all paths between two nodes): >> >> (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") >> dist apd dump >> --------------------------------------------- >> 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) >> 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) >> 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) >> 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. >> >> An alternative to detect loops quickly, is running an all paths query from a node to itself: >> >> Example (loop detection with all paths): >> >> (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") >> dist apd dump >> --------------------------------------------- >> 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) >> 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) >> 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) >> 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) >> 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) >> >> We get the loop control, plus the loop-back `190 IfTrue`. >> >> Example (loop detection with all paths for phi): >> >> (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") >> dist apd dump >> --------------------------------------------- >> 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) >> 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) >> >> >> **Color examples** >> Colors are especially useful to see chains between nodes (options character `#`). >> The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. >> Tip: it can be worth it to configure the colors of your terminal to be more appealing. >> >> Example (find control dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) >> We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. >> >> Example (find memory dependency of data node): >> ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) >> >> Example (loop detection): >> ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) >> We find the control and some data loop paths. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > response to review feedback Thanks for addressing my comments, looks good! ------------- Marked as reviewed by rcastanedalo (Committer). PR: https://git.openjdk.java.net/jdk/pull/8468 From chagedorn at openjdk.java.net Wed Jun 8 10:45:33 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 8 Jun 2022 10:45:33 GMT Subject: Integrated: 8285965: TestScenarios.java does not check for "" correctly In-Reply-To: References: Message-ID: <2fPNY2Tn9Pk30Fr2IDXyzcd3FdQGlKIWK4TkgOBFSWY=.d3e6ba8a-b102-490e-9b72-62ddfce7fffe@github.com> On Wed, 11 May 2022 06:13:12 GMT, Christian Hagedorn wrote: > This is another rare occurrence of `` that is not handled correctly by `TestScenarios.java`. > > We wrongly search this safepoint message in the test VM output with `getTestVMOutput()`: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/Utils.java#L44-L53 > > But this does not help since the IR matcher is parsing the `hotspot_pid` file for IR matching and not the test VM output. We could therefore find this safepoint message in the `hotspod_pid` file and bail out of IR matching while the test VM output does not contain it. This lets `TestScenarios.java` fail. > > The fix we did for other IR framework tests is to redirect the output of the JTreg test VM itself to a stream in order to search it for ``. We are dumping this message as part of a warning when the IR matcher bails out: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/IRMatcher.java#L86-L96 > > Output for the reported failure: > > Scenario #3 - [-XX:TLABRefillWasteFraction=53]: > [...] > Found , bail out of IR matching > > > I suggest to use the same fix for `TestScenarios`. > > Thanks, > Christian This pull request has now been integrated. Changeset: 6e3e470d Author: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/6e3e470dac80d3b6c3a0f4845ce4115858178dd3 Stats: 38 lines in 2 files changed: 11 ins; 16 del; 11 mod 8285965: TestScenarios.java does not check for "" correctly Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8647 From fjiang at openjdk.java.net Wed Jun 8 10:50:31 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Wed, 8 Jun 2022 10:50:31 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 09:08:52 GMT, Yadong Wang wrote: > Please make sure SuperWord works properly without the match rule of PopCountVI and PopCountVL. See hotspot/jtreg/compiler/vectorization/TestPopCountVector.java and hotspot/jtreg/compiler/vectorization/TestPopCountVectorLong.java. TestPopCountVector.java and TestPopCountVectorLong.java still passed with this change. ------------- PR: https://git.openjdk.java.net/jdk/pull/9079 From fjiang at openjdk.java.net Wed Jun 8 10:50:32 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Wed, 8 Jun 2022 10:50:32 GMT Subject: RFR: 8287970: riscv: jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 07:19:18 GMT, Vladimir Kozlov wrote: >> Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: >> >> disable Op_PopCountV* in op_vec_supported > > Can you instead add Op_PopCountVI to op_vec_supported()? > Current change looks trivial and fine too. @vnkozlov @dean-long @RealFYang @yadongw -- Thanks for the review! ------------- PR: https://git.openjdk.java.net/jdk/pull/9079 From jbhateja at openjdk.java.net Wed Jun 8 11:53:34 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 8 Jun 2022 11:53:34 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 07:01:00 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > merge the two cases into one You are seeing this problem in auto-vectorizer flow which if folding (byte)(a0[i] << (7) | a0[i] >>> (-7)); into scalar rotate IR. I think we can remove unsupported type assertion from VectorNode::is_vector_rotate_supported since there is no direct x86 vector rotate instruction. There are ways to handle it by unpacking sub-word lanes to integer lanes followed by vector rotation but its currently not supported and also we saw better performance on dismantling rotate into shifts and or operations. src/hotspot/share/opto/vectornode.cpp line 157: > 155: case T_LONG: return Op_RotateLeftV; > 156: default: return 0; // RotateLeftV for byte, short values produces incorrect Java result. > 157: // Because java code should convert a byte, short value into int value, Current handling creates a vector rotate IR nodes for all integral types and later on dismantles it into constituent operation (SHIFTs and ORs). Idea behind this to intrinsify lanewise vector operation for any integral type. Your above change will prevent intrinsification and test may still pass. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Wed Jun 8 12:41:39 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Wed, 8 Jun 2022 12:41:39 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. [v2] In-Reply-To: References: Message-ID: <8gCc37Pvuj_As6kgJgBNw6evHP7YC88kEDP8Sxa1Uf8=.475b9602-9568-48e7-8458-f3a24a8b1ed5@github.com> > Hi All, > > Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. > > This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. > > Patch extends existing @IR annotation with following two new options:- > > - applyIfTargetFeatureAnd: > Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. > - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. > > Example usage: > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) > > Please review and share your feedback. > > Thanks, > Swati Swati Sharma has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'openjdk:master' into JDK-8287525 - 8287525: Extend IR annotation with new options to test specific target feature. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8999/files - new: https://git.openjdk.java.net/jdk/pull/8999/files/88e24ec3..87454472 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=00-01 Stats: 15722 lines in 583 files changed: 12093 ins; 1975 del; 1654 mod Patch: https://git.openjdk.java.net/jdk/pull/8999.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8999/head:pull/8999 PR: https://git.openjdk.java.net/jdk/pull/8999 From fjiang at openjdk.java.net Wed Jun 8 12:42:34 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Wed, 8 Jun 2022 12:42:34 GMT Subject: Integrated: 8287970: riscv: jdk/incubator/vector/*VectorTests failing In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 06:06:08 GMT, Feilong Jiang wrote: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added a new vector operation VectorOperations.BIT_COUNT, which needs the support of PopCountV*. The following tests failed when enabling `UseRVV`: > > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/ByteMaxVectorTests.java > jdk/incubator/vector/Int256VectorTests.java > jdk/incubator/vector/IntMaxVectorTests.java > jdk/incubator/vector/Short256VectorTests.java > jdk/incubator/vector/ShortMaxVectorTests.java > > Tests are failing with "assert(n_type->isa_vect() == __null || lrg._is_vector || ireg == Op_RegD || ireg == Op_RegL || ireg == Op_RegVectMask) failed: vector must be in vector registers" because C2 instruct "vpopcountI" stores the result into a general-purpose register (GPR) instead of a vector register. > > Currently, riscv vector extension `vpopc.m` instruction counts the number of mask elements of the active elements of the vector source mask register that has the value 1 and writes the result to a scalar x register. [1] `PopCountV*` needs to write back the pop counting results to vector registers, there is no single instruction in rvv that can satisfy the requirement. So we decide to remove the vpopcountI instruct for now. > > [1]: https://github.com/riscv/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf > > Additional Tests: > - [x] jdk/incubator/vector (release with UseRVV on QEMU) > - [x] hotspot:tier1 (release with UseRVV on QEMU without new failure) This pull request has now been integrated. Changeset: 5ad6286b Author: Feilong Jiang Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/5ad6286b73889e47f40d0051a96ef91137faa25c Stats: 14 lines in 1 file changed: 2 ins; 12 del; 0 mod 8287970: riscv: jdk/incubator/vector/*VectorTests failing Reviewed-by: kvn, fyang, dlong, yadongwang ------------- PR: https://git.openjdk.java.net/jdk/pull/9079 From duke at openjdk.java.net Wed Jun 8 12:57:31 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Wed, 8 Jun 2022 12:57:31 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. [v3] In-Reply-To: References: Message-ID: > Hi All, > > Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. > > This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. > > Patch extends existing @IR annotation with following two new options:- > > - applyIfTargetFeatureAnd: > Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. > - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. > > Example usage: > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) > > Please review and share your feedback. > > Thanks, > Swati Swati Sharma has updated the pull request incrementally with two additional commits since the last revision: - Merge branch 'JDK-8287525' of https://github.com/swati-sha/jdk into JDK-8287525 - 8287525: Review comments resolved. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8999/files - new: https://git.openjdk.java.net/jdk/pull/8999/files/87454472..c7fd621b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=01-02 Stats: 268 lines in 5 files changed: 144 ins; 90 del; 34 mod Patch: https://git.openjdk.java.net/jdk/pull/8999.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8999/head:pull/8999 PR: https://git.openjdk.java.net/jdk/pull/8999 From jbhateja at openjdk.java.net Wed Jun 8 13:25:40 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 8 Jun 2022 13:25:40 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 23:27:23 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Fix extra space src/hotspot/cpu/x86/x86.ad line 1892: > 1890: // Conversion to long in addition needs avx512dq > 1891: // Need avx512vl for size_in_bits < 512 > 1892: if (is_integral_type(bt) && (bt != T_INT)) { Why special check for bt != T_INT src/hotspot/cpu/x86/x86.ad line 7349: > 7347: assert(to_elem_bt == T_BYTE, "required"); > 7348: __ evpmovdb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); > 7349: } We do support F2I cast on AVX2 and that can be extended for sub-word types using signed saturated lane packing instructions (PACKSSDW and PACKSSWB). src/hotspot/cpu/x86/x86.ad line 7388: > 7386: case T_BYTE: > 7387: __ evpmovsqd($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); > 7388: __ evpmovdb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); Sub-word handling can be extended for AVX2 using packing instruction sequence similar to VectorStoreMask for quad ward lanes. src/hotspot/cpu/x86/x86.ad line 7391: > 7389: break; > 7390: default: assert(false, "%s", type2name(to_elem_bt)); > 7391: } Please move this to a macro assembly routine named vector_castD2X_evex test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 45: > 43: private static final int COUNT = 16; > 44: private static final VectorSpecies fspec512 = FloatVector.SPECIES_512; > 45: private static final VectorSpecies dspec512 = DoubleVector.SPECIES_512; Unused declarations. test/micro/org/openjdk/bench/jdk/incubator/vector/VectorFPtoIntCastOperations.java line 59: > 57: @Benchmark > 58: public IntVector microFloat2Int() { > 59: return (IntVector)fvec512.convertShape(VectorOperators.F2I, IntVector.SPECIES_512, 0); We can remove explicit cast by setting return type to Vector Applicable to all cases. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From chagedorn at openjdk.java.net Wed Jun 8 14:15:56 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 8 Jun 2022 14:15:56 GMT Subject: RFR: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node [v2] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 20:58:18 GMT, Devin Smith wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/c2/TestRemoveMemBarPrecEdge.java >> >> Co-authored-by: Tobias Hartmann > > I'm not qualified to review this, but I can confirm the test is triggering the `PhaseAggressiveCoalesce::coalesce` SIGSEGV against the 11/17 LTS versions I'm running. Thanks for your investigation and fix! Thanks @devinrsmith for confirming this! ------------- PR: https://git.openjdk.java.net/jdk/pull/9060 From chagedorn at openjdk.java.net Wed Jun 8 14:15:58 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 8 Jun 2022 14:15:58 GMT Subject: Integrated: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 12:16:00 GMT, Christian Hagedorn wrote: > When intrisifying `java.lang.Thread::currentThread()`, we are creating an `AddP` node that has the `top` node as base to indicate that we do not have an oop (using `NULL` instead leads to crashes as it does not seem to be expected to have a `NULL` base): > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/library_call.cpp#L904 > > This node is used on a chain of data nodes into two `MemBarAcquire` nodes as precedence edge in the test case: > ![Screenshot from 2022-06-07 11-12-38](https://user-images.githubusercontent.com/17833009/172344751-5338b72f-baa5-4e9e-a44c-6d970798d9f2.png) > > Later, in `final_graph_reshaping_impl()`, we are removing the precedence edge of both `MemBarAcquire` nodes and clean up all now dead nodes as a result of the removal: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/compile.cpp#L3655-L3679 > > We iteratively call `disconnect_inputs()` for all nodes that have no output anymore (i.e. dead nodes). This code, however, also treats the `top` node as dead since `outcnt()` of `top` is always zero: > https://github.com/openjdk/jdk/blob/6ff2d89ea11934bb13c8a419e7bad4fd40f76759/src/hotspot/share/opto/node.hpp#L495-L500 > > And we end up disconnecting `top` which results in the assertion failure. > > The code misses a check for `top()`. I suggest to add this check before processing a node for which `outcnt()` is zero. This is a pattern which can also be found in other places in the code. I've checked all other usages of `oucnt() == 0` and could not find a case where this additional `top()` check is missing. Maybe we should refactor these two checks into a single method at some point to not need to worry about `top` anymore in the future when checking if a node is dead based on the outputs. > > Thanks, > Christian This pull request has now been integrated. Changeset: 78d37126 Author: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/78d371266ae8a629db8176ced4d48e9521702cce Stats: 56 lines in 2 files changed: 55 ins; 0 del; 1 mod 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/9060 From epeter at openjdk.java.net Wed Jun 8 14:53:00 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Wed, 8 Jun 2022 14:53:00 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump Message-ID: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> **Goal** Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. **Proposal** Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). Thus, I present these additional functions: `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. The nodes are sorted by node idx, and then dumped. Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) **Usecase** Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. You can probably find more usecases yourself ;) ------------- Commit messages: - style fixes, and implemented case insensitive strstr - make matching case insensitive - missing include - ensure null termination - guard against long pattern, and fix array/pointer issues - small logic fix - refactoring of Node::find - sort find results by idx - 8287647: VM debug support: find node by name or substring in dump text Changes: https://git.openjdk.java.net/jdk/pull/8988/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287647 Stats: 242 lines in 1 file changed: 194 ins; 47 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8988.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8988/head:pull/8988 PR: https://git.openjdk.java.net/jdk/pull/8988 From duke at openjdk.java.net Wed Jun 8 16:09:33 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Wed, 8 Jun 2022 16:09:33 GMT Subject: Integrated: 8286990: Add compiler name to warning messages in Compiler Directive In-Reply-To: References: Message-ID: On Mon, 9 May 2022 05:23:14 GMT, Yuta Sato wrote: > When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , > it shows totally the same message for c1 and c2 compiler and the user would be confused about > which compiler is affected by this message. > This should show messages with their compiler name so that the user knows which compiler shows this message. > > My change result would be like the below. > > > OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > > -> > > OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output This pull request has now been integrated. Changeset: c68419f2 Author: yuu1127 Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/c68419f2f778f796d410ba3d27e916ae47700af5 Stats: 23 lines in 2 files changed: 18 ins; 0 del; 5 mod 8286990: Add compiler name to warning messages in Compiler Directive Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8591 From duke at openjdk.java.net Wed Jun 8 16:17:39 2022 From: duke at openjdk.java.net (Devin Smith) Date: Wed, 8 Jun 2022 16:17:39 GMT Subject: RFR: 8287432: C2: assert(tn->in(0) != __null) failed: must have live top node [v2] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 14:10:55 GMT, Christian Hagedorn wrote: >> I'm not qualified to review this, but I can confirm the test is triggering the `PhaseAggressiveCoalesce::coalesce` SIGSEGV against the 11/17 LTS versions I'm running. Thanks for your investigation and fix! > > Thanks @devinrsmith for confirming this! @chhagedorn, thanks again! I'm hoping this can be backported to 11/17 - let me know if there is anything I can do to make that happen. ------------- PR: https://git.openjdk.java.net/jdk/pull/9060 From sviswanathan at openjdk.java.net Wed Jun 8 16:26:58 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 8 Jun 2022 16:26:58 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: > Currently the C2 JIT only supports float -> int and double -> long conversion for x86. > This PR adds the support for following conversions in the c2 JIT: > float -> long, short, byte > double -> int, short, byte > > The performance gain is as follows. > Before the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms > > After the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Review commit resolution ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/9032/files - new: https://git.openjdk.java.net/jdk/pull/9032/files/996ee049..443e861f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9032&range=04-05 Stats: 63 lines in 5 files changed: 27 ins; 19 del; 17 mod Patch: https://git.openjdk.java.net/jdk/pull/9032.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9032/head:pull/9032 PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Wed Jun 8 16:38:42 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 8 Jun 2022 16:38:42 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:26:58 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review commit resolution @jatin-bhateja I have implemented your review comments. Please take a look. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Wed Jun 8 16:38:45 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 8 Jun 2022 16:38:45 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 12:47:37 GMT, Jatin Bhateja wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix extra space > > src/hotspot/cpu/x86/x86.ad line 7349: > >> 7347: assert(to_elem_bt == T_BYTE, "required"); >> 7348: __ evpmovdb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); >> 7349: } > > We do support F2I cast on AVX2 and that can be extended for sub-word types using > signed saturated lane packing instructions (PACKSSDW and PACKSSWB). I will file a separate RFE for this. > src/hotspot/cpu/x86/x86.ad line 7388: > >> 7386: case T_BYTE: >> 7387: __ evpmovsqd($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); >> 7388: __ evpmovdb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); > > Sub-word handling can be extended for AVX2 using packing instruction sequence similar to VectorStoreMask for quad ward lanes. D2X in general needs AVX 512 due to evcvttpd2qq. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From sviswanathan at openjdk.java.net Wed Jun 8 16:38:46 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 8 Jun 2022 16:38:46 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:31:06 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/x86.ad line 7349: >> >>> 7347: assert(to_elem_bt == T_BYTE, "required"); >>> 7348: __ evpmovdb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); >>> 7349: } >> >> We do support F2I cast on AVX2 and that can be extended for sub-word types using >> signed saturated lane packing instructions (PACKSSDW and PACKSSWB). > > I will file a separate RFE for this. Link to RFE: https://bugs.openjdk.org/browse/JDK-8288043 ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From shade at openjdk.java.net Wed Jun 8 16:59:13 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 8 Jun 2022 16:59:13 GMT Subject: RFR: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 Message-ID: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> See the bug report for example build failure. This breakage is widely seen on many build servers. The fix follows what [https://bugs.openjdk.org/browse/JDK-8210803](JDK-8210803) did for other blobs. Additional testing: - [ ] Windows x86_32 build ------------- Commit messages: - Fix Changes: https://git.openjdk.java.net/jdk/pull/9088/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9088&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287493 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/9088.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9088/head:pull/9088 PR: https://git.openjdk.java.net/jdk/pull/9088 From kvn at openjdk.java.net Wed Jun 8 17:22:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 8 Jun 2022 17:22:40 GMT Subject: RFR: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 In-Reply-To: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> References: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> Message-ID: On Wed, 8 Jun 2022 16:50:22 GMT, Aleksey Shipilev wrote: > See the bug report for example build failure. This breakage is widely seen on many build servers. The fix follows what [https://bugs.openjdk.org/browse/JDK-8210803](JDK-8210803) did for other blobs. > > Additional testing: > - [x] Windows x86_32 fastdebug build > - [x] Linux x86_64 fastdebug `java/foreign` I would say it is trivial. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9088 From kvn at openjdk.java.net Wed Jun 8 17:55:25 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 8 Jun 2022 17:55:25 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v2] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Wed, 8 Jun 2022 09:39:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - remove comments > - review comments Good. I submitted testing. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9068 From zgu at openjdk.java.net Wed Jun 8 18:25:36 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Wed, 8 Jun 2022 18:25:36 GMT Subject: RFR: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 In-Reply-To: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> References: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> Message-ID: On Wed, 8 Jun 2022 16:50:22 GMT, Aleksey Shipilev wrote: > See the bug report for example build failure. This breakage is widely seen on many build servers. The fix follows what [https://bugs.openjdk.org/browse/JDK-8210803](JDK-8210803) did for other blobs. > > Additional testing: > - [x] Windows x86_32 fastdebug build > - [x] Linux x86_64 fastdebug `java/foreign` Changes requested by zgu (Reviewer). Marked as reviewed by zgu (Reviewer). src/hotspot/share/code/codeBlob.hpp line 773: > 771: // below two-argument operator delete will be treated as a placement > 772: // delete rather than an ordinary sized delete; see C++14 3.7.4.2/p2. > 773: void operator delete(void* p); I suggest to move this delete operator to RuntimeBlob to avoid adding this method in potential new subclasses. ------------- PR: https://git.openjdk.java.net/jdk/pull/9088 From shade at openjdk.java.net Wed Jun 8 18:25:37 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 8 Jun 2022 18:25:37 GMT Subject: RFR: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 In-Reply-To: References: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> Message-ID: On Wed, 8 Jun 2022 18:18:58 GMT, Zhengyu Gu wrote: >> See the bug report for example build failure. This breakage is widely seen on many build servers. The fix follows what [https://bugs.openjdk.org/browse/JDK-8210803](JDK-8210803) did for other blobs. >> >> Additional testing: >> - [x] Windows x86_32 fastdebug build >> - [x] Linux x86_64 fastdebug `java/foreign` > > src/hotspot/share/code/codeBlob.hpp line 773: > >> 771: // below two-argument operator delete will be treated as a placement >> 772: // delete rather than an ordinary sized delete; see C++14 3.7.4.2/p2. >> 773: void operator delete(void* p); > > I suggest to move this delete operator to RuntimeBlob to avoid adding this method in potential new subclasses. Not sure how well that works with non-virtual operators. We can consider that as the follow-up cleanup. Meanwhile, I'd like to push this one trivial build fix, OK? ------------- PR: https://git.openjdk.java.net/jdk/pull/9088 From zgu at openjdk.java.net Wed Jun 8 18:25:37 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Wed, 8 Jun 2022 18:25:37 GMT Subject: RFR: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 In-Reply-To: References: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> Message-ID: On Wed, 8 Jun 2022 18:20:38 GMT, Aleksey Shipilev wrote: >> src/hotspot/share/code/codeBlob.hpp line 773: >> >>> 771: // below two-argument operator delete will be treated as a placement >>> 772: // delete rather than an ordinary sized delete; see C++14 3.7.4.2/p2. >>> 773: void operator delete(void* p); >> >> I suggest to move this delete operator to RuntimeBlob to avoid adding this method in potential new subclasses. > > Not sure how well that works with non-virtual operators. We can consider that as the follow-up cleanup. Meanwhile, I'd like to push this one trivial build fix, OK? Okay then. ------------- PR: https://git.openjdk.java.net/jdk/pull/9088 From alanb at openjdk.java.net Wed Jun 8 18:31:33 2022 From: alanb at openjdk.java.net (Alan Bateman) Date: Wed, 8 Jun 2022 18:31:33 GMT Subject: RFR: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 In-Reply-To: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> References: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> Message-ID: On Wed, 8 Jun 2022 16:50:22 GMT, Aleksey Shipilev wrote: > See the bug report for example build failure. This breakage is widely seen on many build servers. The fix follows what [https://bugs.openjdk.org/browse/JDK-8210803](JDK-8210803) did for other blobs. > > Additional testing: > - [x] Windows x86_32 fastdebug build > - [x] Linux x86_64 fastdebug `java/foreign` Marked as reviewed by alanb (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/9088 From jvernee at openjdk.java.net Wed Jun 8 18:58:31 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 8 Jun 2022 18:58:31 GMT Subject: RFR: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 In-Reply-To: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> References: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> Message-ID: On Wed, 8 Jun 2022 16:50:22 GMT, Aleksey Shipilev wrote: > See the bug report for example build failure. This breakage is widely seen on many build servers. The fix follows what [https://bugs.openjdk.org/browse/JDK-8210803](JDK-8210803) did for other blobs. > > Additional testing: > - [x] Windows x86_32 fastdebug build > - [x] Linux x86_64 fastdebug `java/foreign` > - [x] Windows x86_32 fastdebug `java/foreign` LGTM. I thought this wasn't needed since the comment talks about a two-argument operator delete, which we don't have in this class. When I tried removing it everything seemed fine so I went with that. Maybe good to update the comment here as well... (and maybe even in other places in codeBlob.hpp as well). Since the issue only seems to occur with MSVC on 32-bit platforms (?), I think it would be good to mention that in the comment in case people wonder if removing this operator will cause issues in the future (like me). ------------- Marked as reviewed by jvernee (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9088 From kvn at openjdk.java.net Wed Jun 8 19:40:33 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 8 Jun 2022 19:40:33 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump In-Reply-To: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: <31kbCZuU6EzZCELnZLbs4BvbFFz2n8u2CfXKt4PLrIs=.c3bc76db-5e86-4d97-8ffe-e94d4b5a220b@github.com> On Thu, 2 Jun 2022 09:16:28 GMT, Emanuel Peter wrote: > **Goal** > Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. > > **Proposal** > Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). > > Thus, I present these additional functions: > `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. > `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. > The nodes are sorted by node idx, and then dumped. > > Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) > > **Usecase** > Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. > > Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. > Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. > Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. > > Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. > > You can probably find more usecases yourself ;) Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8988 From kvn at openjdk.java.net Wed Jun 8 23:57:32 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 8 Jun 2022 23:57:32 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v2] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Wed, 8 Jun 2022 09:39:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - remove comments > - review comments Tier1-4 testing passed - no new failures. I suggest to push it into JDK 20 after fork and after you get second review. ------------- PR: https://git.openjdk.java.net/jdk/pull/9068 From xgong at openjdk.java.net Thu Jun 9 01:27:38 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 9 Jun 2022 01:27:38 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 02:56:52 GMT, Xiaohong Gong wrote: >> And I don't see in(2)->Opcode() == Op_VectorMaskGen check. >Yes, the Op_VectorMaskGen is not generated for MaskAll when its input is a constant. We directly transform the MaskAll to VectorMaskGen here, since they two have the same meanings. Thanks! I'm sorry that my comment in line-1819 is not right which misunderstood you. I will change this later. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/9037 From duke at openjdk.java.net Thu Jun 9 01:31:50 2022 From: duke at openjdk.java.net (Haomin) Date: Thu, 9 Jun 2022 01:31:50 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: References: Message-ID: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> > static void test_fun(byte[] a0, int[] b0, byte[] c0) { > for (int i=0; i c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); > } > } > > > when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. > > It's executed on x86 would create an assert error. > > > # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 > # assert(false) failed: not supported: byte > > > RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. Haomin has updated the pull request incrementally with one additional commit since the last revision: delete assert(false, "not supported: %s", type2name(bt)) ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8740/files - new: https://git.openjdk.java.net/jdk/pull/8740/files/ba5428de..bbc65c8e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=02-03 Stats: 15 lines in 1 file changed: 0 ins; 13 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8740.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8740/head:pull/8740 PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Thu Jun 9 01:43:32 2022 From: duke at openjdk.java.net (Haomin) Date: Thu, 9 Jun 2022 01:43:32 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> References: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> Message-ID: On Thu, 9 Jun 2022 01:31:50 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > delete assert(false, "not supported: %s", type2name(bt)) > Since `"VectorNode::implemented"` is only used for auto-vect, I think we can do the check in `"VectorNode::is_vector_rotate_supported" `and return false for byte and short type. >I think we can remove unsupported type assertion from VectorNode::is_vector_rotate_supported since there is no direct x86 vector rotate instruction. Thanks for your suggestion. I just delete the assertion and not change auto-vectorizer. Please review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From eliu at openjdk.java.net Thu Jun 9 02:00:37 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 9 Jun 2022 02:00:37 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> References: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> Message-ID: On Thu, 9 Jun 2022 01:31:50 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > delete assert(false, "not supported: %s", type2name(bt)) Marked as reviewed by eliu (Author). ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From xgong at openjdk.java.net Thu Jun 9 02:00:38 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 9 Jun 2022 02:00:38 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> References: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> Message-ID: On Thu, 9 Jun 2022 01:31:50 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > delete assert(false, "not supported: %s", type2name(bt)) LGTM! Thanks! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.java.net/jdk/pull/8740 From jbhateja at openjdk.java.net Thu Jun 9 02:12:37 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 9 Jun 2022 02:12:37 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:26:58 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review commit resolution Marked as reviewed by jbhateja (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From jbhateja at openjdk.java.net Thu Jun 9 02:12:38 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 9 Jun 2022 02:12:38 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:31:55 GMT, Sandhya Viswanathan wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Review commit resolution > > @jatin-bhateja I have implemented your review comments. Please take a look. Thanks @sviswa7 ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From jbhateja at openjdk.java.net Thu Jun 9 02:12:41 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 9 Jun 2022 02:12:41 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:28:54 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/x86.ad line 7388: >> >>> 7386: case T_BYTE: >>> 7387: __ evpmovsqd($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); >>> 7388: __ evpmovdb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); >> >> Sub-word handling can be extended for AVX2 using packing instruction sequence similar to VectorStoreMask for quad ward lanes. > > D2X in general needs AVX 512 due to evcvttpd2qq. Yes, but can't we take D2F route to cast to integer and subword range. D2X -> D2F(AVX2) -> F2X. ------------- PR: https://git.openjdk.java.net/jdk/pull/9032 From jiefu at openjdk.java.net Thu Jun 9 02:31:35 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 9 Jun 2022 02:31:35 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> References: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> Message-ID: <9bQmd5imua1mEnyhI9ocewI60VVFuPVtx-FlbtiIAdQ=.0fa360bf-1b7d-4e30-96c7-bd22787e3100@github.com> On Thu, 9 Jun 2022 01:31:50 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > delete assert(false, "not supported: %s", type2name(bt)) test/hotspot/jtreg/compiler/vectorization/TestRotateByteAndShortVector.java line 2: > 1: /* > 2: * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved. Why you need this copyright? Maybe, we can remove it. test/hotspot/jtreg/compiler/vectorization/TestRotateByteAndShortVector.java line 3: > 1: /* > 2: * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved. > 3: * Copyright (c) 2022, Loongson Technology. All rights reserved. The copyright line isn't the same with the one in `compiler/compilercontrol/CompilationModeHighOnlyTest.java`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From jbhateja at openjdk.java.net Thu Jun 9 02:35:31 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 9 Jun 2022 02:35:31 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> References: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> Message-ID: On Thu, 9 Jun 2022 01:31:50 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > delete assert(false, "not supported: %s", type2name(bt)) test/hotspot/jtreg/compiler/vectorization/TestRotateByteAndShortVector.java line 27: > 25: /** > 26: * @test > 27: * @bug 8286847 Kindly mark the test with @key randomness ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From fgao at openjdk.java.net Thu Jun 9 02:54:15 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 9 Jun 2022 02:54:15 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v8] In-Reply-To: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: > After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. > > After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). > > After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. > > Here is the test data (-XX:+UseSuperWord) on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: - Update to the latest JDK and fix the function name Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc - Merge branch 'master' into fg8283091 Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 - Add assertion line for opcode() and withdraw some common code as a function Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe - Merge branch 'master' into fg8283091 Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 - Implement an interface for auto-vectorization to consult supported match rules Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 - Merge branch 'master' into fg8283091 Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd - Merge branch 'master' into fg8283091 Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f - Merge branch 'master' into fg8283091 Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 - Add micro-benchmark cases Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 - Merge branch 'master' into fg8283091 Change-Id: I674581135fd0844accc65520574fcef161eededa - ... and 1 more: https://git.openjdk.java.net/jdk/compare/230726ea...0d731bb2 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7806/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=07 Stats: 1282 lines in 22 files changed: 1223 ins; 13 del; 46 mod Patch: https://git.openjdk.java.net/jdk/pull/7806.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7806/head:pull/7806 PR: https://git.openjdk.java.net/jdk/pull/7806 From fgao at openjdk.java.net Thu Jun 9 02:54:16 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 9 Jun 2022 02:54:16 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v7] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Mon, 6 Jun 2022 19:38:14 GMT, Vladimir Kozlov wrote: > May be we should rename it to `match_rule_supported_superword`. That is how we call auto-vectorizer in C2. Done. And update after https://github.com/openjdk/jdk/pull/8877. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From duke at openjdk.java.net Thu Jun 9 02:59:39 2022 From: duke at openjdk.java.net (Haomin) Date: Thu, 9 Jun 2022 02:59:39 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v5] In-Reply-To: References: Message-ID: > static void test_fun(byte[] a0, int[] b0, byte[] c0) { > for (int i=0; i c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); > } > } > > > when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. > > It's executed on x86 would create an assert error. > > > # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 > # assert(false) failed: not supported: byte > > > RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. Haomin has updated the pull request incrementally with one additional commit since the last revision: change copyright, add @key ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8740/files - new: https://git.openjdk.java.net/jdk/pull/8740/files/bbc65c8e..c26d8782 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=03-04 Stats: 3 lines in 1 file changed: 1 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8740.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8740/head:pull/8740 PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Thu Jun 9 03:01:59 2022 From: duke at openjdk.java.net (Haomin) Date: Thu, 9 Jun 2022 03:01:59 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: <9bQmd5imua1mEnyhI9ocewI60VVFuPVtx-FlbtiIAdQ=.0fa360bf-1b7d-4e30-96c7-bd22787e3100@github.com> References: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> <9bQmd5imua1mEnyhI9ocewI60VVFuPVtx-FlbtiIAdQ=.0fa360bf-1b7d-4e30-96c7-bd22787e3100@github.com> Message-ID: <3xRWUhjhs6vL49AqBgjiqebnhosR5rixKtLm_eycfHA=.442e94f1-e24d-4a81-9c5c-b0e58e57bf80@github.com> On Thu, 9 Jun 2022 02:27:43 GMT, Jie Fu wrote: >> Haomin has updated the pull request incrementally with one additional commit since the last revision: >> >> delete assert(false, "not supported: %s", type2name(bt)) > > test/hotspot/jtreg/compiler/vectorization/TestRotateByteAndShortVector.java line 2: > >> 1: /* >> 2: * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved. > > Why you need this copyright? > Maybe, we can remove it. Yes, just remove it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Thu Jun 9 03:02:00 2022 From: duke at openjdk.java.net (Haomin) Date: Thu, 9 Jun 2022 03:02:00 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: References: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> Message-ID: On Thu, 9 Jun 2022 02:32:03 GMT, Jatin Bhateja wrote: >> Haomin has updated the pull request incrementally with one additional commit since the last revision: >> >> delete assert(false, "not supported: %s", type2name(bt)) > > test/hotspot/jtreg/compiler/vectorization/TestRotateByteAndShortVector.java line 27: > >> 25: /** >> 26: * @test >> 27: * @bug 8286847 > > Kindly mark the test with @key randomness DONE ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From jiefu at openjdk.java.net Thu Jun 9 03:08:34 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 9 Jun 2022 03:08:34 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v5] In-Reply-To: References: Message-ID: <5iXKKZoxx6dnewvySxW2aCevxJFrbH76Ig6mT4-p6Xo=.446391a0-a2f5-40e8-8a5d-10e1addefeb0@github.com> On Thu, 9 Jun 2022 02:59:39 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > change copyright, add @key LGTM Thanks for the update. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Thu Jun 9 03:08:36 2022 From: duke at openjdk.java.net (Haomin) Date: Thu, 9 Jun 2022 03:08:36 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v4] In-Reply-To: <9bQmd5imua1mEnyhI9ocewI60VVFuPVtx-FlbtiIAdQ=.0fa360bf-1b7d-4e30-96c7-bd22787e3100@github.com> References: <1GiKCDIfJmLCG_QEwYNsd75Y2IfXxrdR8vs0iHiDc8Y=.e2a170bd-5626-4c79-9fd9-c627f3cd387f@github.com> <9bQmd5imua1mEnyhI9ocewI60VVFuPVtx-FlbtiIAdQ=.0fa360bf-1b7d-4e30-96c7-bd22787e3100@github.com> Message-ID: On Thu, 9 Jun 2022 02:26:49 GMT, Jie Fu wrote: >> Haomin has updated the pull request incrementally with one additional commit since the last revision: >> >> delete assert(false, "not supported: %s", type2name(bt)) > > test/hotspot/jtreg/compiler/vectorization/TestRotateByteAndShortVector.java line 3: > >> 1: /* >> 2: * Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved. >> 3: * Copyright (c) 2022, Loongson Technology. All rights reserved. > > The copyright line isn't the same with the one in `compiler/compilercontrol/CompilationModeHighOnlyTest.java`. I hava changed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From jbhateja at openjdk.java.net Thu Jun 9 03:19:29 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 9 Jun 2022 03:19:29 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v5] In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 02:59:39 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > change copyright, add @key Thanks @haominw ------------- Marked as reviewed by jbhateja (Committer). PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Thu Jun 9 04:04:30 2022 From: duke at openjdk.java.net (Haomin) Date: Thu, 9 Jun 2022 04:04:30 GMT Subject: Integrated: 8286847: Rotate vectors don't support byte or short In-Reply-To: References: Message-ID: <8pfvKieQH55kA3Phnihd2Px0yGY4vFy9FOx6wee49K4=.058d1571-b074-4cc7-aaf1-3f2c316c064d@github.com> On Tue, 17 May 2022 03:09:12 GMT, Haomin wrote: > static void test_fun(byte[] a0, int[] b0, byte[] c0) { > for (int i=0; i c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); > } > } > > > when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. > > It's executed on x86 would create an assert error. > > > # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 > # assert(false) failed: not supported: byte > > > RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. This pull request has now been integrated. Changeset: 3419beec Author: wanghaomin Committer: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/3419beec7fa646ab30f55ac27fdb47c4c1e1e764 Stats: 157 lines in 2 files changed: 156 ins; 1 del; 0 mod 8286847: Rotate vectors don't support byte or short Reviewed-by: eliu, xgong, jiefu, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Thu Jun 9 04:39:35 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Thu, 9 Jun 2022 04:39:35 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v4] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 01:10:12 GMT, Yuta Sato wrote: >> When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), >> there is no warning message (only can see info level messages if put -Xlog:os=info). >> This should show a warning message to tell the user that you failed to load libraries for hsdis. >> So I put a warning message to notify this. >> >> e.g. >> ` > > Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: > > change warning message PING: Could anyone review this? ------------- PR: https://git.openjdk.java.net/jdk/pull/8782 From shade at openjdk.java.net Thu Jun 9 05:50:32 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 9 Jun 2022 05:50:32 GMT Subject: RFR: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 In-Reply-To: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> References: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> Message-ID: On Wed, 8 Jun 2022 16:50:22 GMT, Aleksey Shipilev wrote: > See the bug report for example build failure. This breakage is widely seen on many build servers. The fix follows what [https://bugs.openjdk.org/browse/JDK-8210803](JDK-8210803) did for other blobs. > > Additional testing: > - [x] Windows x86_32 fastdebug build > - [x] Linux x86_64 fastdebug `java/foreign` > - [x] Windows x86_32 fastdebug `java/foreign` Thank you for reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/9088 From shade at openjdk.java.net Thu Jun 9 05:53:26 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 9 Jun 2022 05:53:26 GMT Subject: Integrated: 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 In-Reply-To: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> References: <70DP3JRe1qkz7bsQudwx0bMcacs1N7IVm1uVXpdgoaQ=.87a30276-6775-4fb8-bf9b-21f2dc2788a1@github.com> Message-ID: <5lg_zQ5FNplb5nhPUZhhgx5O-joZ8w0Tqa6qe39-RZY=.b2522f78-b115-4e22-af76-f1e303eb22bd@github.com> On Wed, 8 Jun 2022 16:50:22 GMT, Aleksey Shipilev wrote: > See the bug report for example build failure. This breakage is widely seen on many build servers. The fix follows what [https://bugs.openjdk.org/browse/JDK-8210803](JDK-8210803) did for other blobs. > > Additional testing: > - [x] Windows x86_32 fastdebug build > - [x] Linux x86_64 fastdebug `java/foreign` > - [x] Windows x86_32 fastdebug `java/foreign` This pull request has now been integrated. Changeset: aa2fc54b Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/aa2fc54b61ad84cc6faa80efa3bd3097adbbc422 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod 8287493: 32-bit Windows build failure in codeBlob.cpp after JDK-8283689 Reviewed-by: kvn, zgu, alanb, jvernee ------------- PR: https://git.openjdk.java.net/jdk/pull/9088 From xgong at openjdk.java.net Thu Jun 9 06:52:38 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 9 Jun 2022 06:52:38 GMT Subject: RFR: 8287028: AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 [v2] In-Reply-To: References: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> Message-ID: <34goOfG3-pLQKS0rzD8fCRhzbmlgkW4TlM_920F8BYo=.dc3d5f8f-a8b1-4ecd-a75c-9196a4e9f7c4@github.com> On Thu, 2 Jun 2022 08:24:56 GMT, Eric Liu wrote: >> This patch implements AArch64 codegen for VectorLongToMask using the >> SVE2 BitPerm feature. With this patch, the final code (generated on an >> SVE vector reg size of 512-bit QEMU emulator) is shown as below: >> >> mov z17.b, #0 >> mov v17.d[0], x13 >> sunpklo z17.h, z17.b >> sunpklo z17.s, z17.h >> sunpklo z17.d, z17.s >> mov z16.b, #1 >> bdep z17.d, z17.d, z16.d >> cmpne p0.b, p7/z, z17.b, #0 > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge jdk:master > > Change-Id: I7cea9b028f60c447f7cc24a00d38f59e0f07ecd3 > - AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 > > This patch implements AArch64 codegen for VectorLongToMask using the > SVE2 BitPerm feature. With this patch, the final code (generated on an > SVE vector reg size of 512-bit QEMU emulator) is shown as below: > > mov z17.b, #0 > mov v17.d[0], x13 > sunpklo z17.h, z17.b > sunpklo z17.s, z17.h > sunpklo z17.d, z17.s > mov z16.b, #1 > bdep z17.d, z17.d, z16.d > cmpne p0.b, p7/z, z17.b, #0 > > Change-Id: I9135fce39c8a08c72b757c78b258f5d968baa7ff LGTM! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.java.net/jdk/pull/8789 From bulasevich at openjdk.java.net Thu Jun 9 08:20:27 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Thu, 9 Jun 2022 08:20:27 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Wed, 8 Jun 2022 16:22:07 GMT, Evgeny Astigeevich wrote: >> It is need for `mark_off`. I'll move it there. > > Or we can keep it and the code will be the same as in x86_64.ad. > What do you think? agree ------------- PR: https://git.openjdk.java.net/jdk/pull/8816 From chagedorn at openjdk.java.net Thu Jun 9 08:49:39 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 9 Jun 2022 08:49:39 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. [v3] In-Reply-To: References: Message-ID: <2daUhb4sxxt35gxJAhWhJZJJfuYg3Z-srPgtIgafRqM=.ab110021-f967-4142-a22e-7d98e0c235ee@github.com> On Wed, 8 Jun 2022 12:57:31 GMT, Swati Sharma wrote: >> Hi All, >> >> Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. >> >> This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. >> >> Patch extends existing @IR annotation with following two new options:- >> >> - applyIfTargetFeatureAnd: >> Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. >> - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. >> >> Example usage: >> @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) >> @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) >> >> Please review and share your feedback. >> >> Thanks, >> Swati > > Swati Sharma has updated the pull request incrementally with two additional commits since the last revision: > > - Merge branch 'JDK-8287525' of https://github.com/swati-sha/jdk into JDK-8287525 > - 8287525: Review comments resolved. Changes requested by chagedorn (Reviewer). test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 191: > 189: applyRules++; > 190: TestFormat.checkNoThrow((irAnno.applyIfCPUFeature().length % 2) == 0, > 191: "Argument count for applyIfCPUFeature should be multiple of two" + failAt()); You should also check that `applyIfCPUFeature` is only applied for one constraint and `applyIfCPUFeatureAnd` for two or more constraints. This follows the format checks for the flag constraints. This flag constraint design, however, is not very satisfying and has some redundancy. I'm planning to rework this with JDK-8280120. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 201: > 199: applyRules++; > 200: TestFormat.checkNoThrow((irAnno.applyIfCPUFeatureOr().length % 2) == 0, > 201: "Argument count for applyIfCPUFeatureOr should be multiple of two" + failAt()); The format check on L208 cannot be used like that anymore (I can somehow not comment on lines that are hidden): TestFormat.checkNoThrow(applyRules <= 1, "Can only specify one apply constraint " + failAt()); We should be allowed to specify one `applyIf*` flag constraint together with one `applyIfCPUFeature*` constraint while not being allowed to specify multiple `applyIf*` flag constraints and/or multiple `applyIfCPUFeature*` constraints. test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 242: > 240: } > 241: > 242: private boolean hasAllRequiredCPUFeature(String[] andRules, String ruleType) { Paremeter is not used anymore: Suggestion: private boolean hasAllRequiredCPUFeature(String[] andRules) { test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 253: > 251: } > 252: > 253: private boolean hasAnyRequiredCPUFeature(String[] orRules, String ruleType) { Paremeter is not used anymore: Suggestion: private boolean hasAnyRequiredCPUFeature(String[] orRules) { test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 302: > 300: // feature is supported by KNL target. > 301: if (isKNLFlagEnabled == null || > 302: (isKNLFlagEnabled.booleanValue() && (!knlFeatureSet.contains(feature.toUpperCase()) || falseValue))) { Boxing is not required: Suggestion: (isKNLFlagEnabled && (!knlFeatureSet.contains(feature.toUpperCase()) || falseValue))) { test/hotspot/jtreg/testlibrary_tests/ir_framework/examples/TestCPUFeatureCheck.java line 38: > 36: */ > 37: > 38: public class TestCPUFeatureCheck { Since this is a full correctness test on emitting `AddVI`, I suggest to move it to the other existing vector IR tests. The tests in `ir_framework.examples` and `ir_framework.tests` are only executed in tier5 and 6 and are more about testing the framework implementation than the actual correctness of C2 transformations. ------------- PR: https://git.openjdk.java.net/jdk/pull/8999 From aph at openjdk.java.net Thu Jun 9 09:43:06 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 9 Jun 2022 09:43:06 GMT Subject: RFR: 8288078: linux-aarch64-optimized build fails in Tier5 after JDK-8287567 Message-ID: Because PRODUCT is defined in 'optimized' builds, this needs to be surrounded by ASSERT ifdefs, not PRODUCT. ------------- Commit messages: - 8288078: AArch64: linux-aarch64-optimized build fails in Tier5 after JDK-8287567 - Merge https://github.com/openjdk/jdk into new-master - Merge https://github.com/openjdk/jdk into new-master - Merge branch 'master' of https://github.com/openjdk/jdk into new-master - Merge https://github.com/openjdk/jdk into new-master - Revert "8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn" - 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn Changes: https://git.openjdk.java.net/jdk/pull/9103/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9103&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288078 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/9103.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9103/head:pull/9103 PR: https://git.openjdk.java.net/jdk/pull/9103 From mdoerr at openjdk.java.net Thu Jun 9 09:53:24 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 9 Jun 2022 09:53:24 GMT Subject: RFR: 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: > `PopCountVI` needs to handle several types after [JDK-8284960](https://bugs.openjdk.java.net/browse/JDK-8284960). > (See https://github.com/openjdk/jdk/commit/6f6486e97743eadfb20b4175e1b4b2b05b59a17a.) Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Support T_LONG instead of T_DOUBLE. Neither of these types are currently used. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8998/files - new: https://git.openjdk.java.net/jdk/pull/8998/files/e439bf3d..b0301536 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8998&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8998&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8998.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8998/head:pull/8998 PR: https://git.openjdk.java.net/jdk/pull/8998 From goetz at openjdk.java.net Thu Jun 9 09:53:24 2022 From: goetz at openjdk.java.net (Goetz Lindenmaier) Date: Thu, 9 Jun 2022 09:53:24 GMT Subject: RFR: 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 09:48:33 GMT, Martin Doerr wrote: >> `PopCountVI` needs to handle several types after [JDK-8284960](https://bugs.openjdk.java.net/browse/JDK-8284960). >> (See https://github.com/openjdk/jdk/commit/6f6486e97743eadfb20b4175e1b4b2b05b59a17a.) > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Support T_LONG instead of T_DOUBLE. Neither of these types are currently used. LGTM src/hotspot/cpu/ppc/ppc.ad line 14117: > 14115: __ vpopcntw($dst$$VectorSRegister->to_vr(), $src$$VectorSRegister->to_vr()); > 14116: break; > 14117: case T_DOUBLE: I think this should be T_LONG. ------------- Marked as reviewed by goetz (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8998 From mdoerr at openjdk.java.net Thu Jun 9 09:53:25 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 9 Jun 2022 09:53:25 GMT Subject: RFR: 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 09:23:36 GMT, Goetz Lindenmaier wrote: >> Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: >> >> Support T_LONG instead of T_DOUBLE. Neither of these types are currently used. > > src/hotspot/cpu/ppc/ppc.ad line 14117: > >> 14115: __ vpopcntw($dst$$VectorSRegister->to_vr(), $src$$VectorSRegister->to_vr()); >> 14116: break; >> 14117: case T_DOUBLE: > > I think this should be T_LONG. Fixed. Note that neither T_DOUBLE nor T_LONG are currently needed. But I'd like to support T_LONG for future enhancements. ------------- PR: https://git.openjdk.java.net/jdk/pull/8998 From adinn at openjdk.java.net Thu Jun 9 10:10:36 2022 From: adinn at openjdk.java.net (Andrew Dinn) Date: Thu, 9 Jun 2022 10:10:36 GMT Subject: RFR: 8288078: linux-aarch64-optimized build fails in Tier5 after JDK-8287567 In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 09:33:24 GMT, Andrew Haley wrote: > Because PRODUCT is defined in 'optimized' builds, this needs to be surrounded by ASSERT ifdefs, not PRODUCT. Patch looks good. @theRealAph > Because PRODUCT is defined in 'optimized' builds, this needs to be surrounded by ASSERT ifdefs, not PRODUCT. I am not sure that comment is correct even though I agree the fix is right. I'll provide my explanation just to be sure I have understood the fix. The error is happening because: 1. `PRODUCT` is *not* defined in 'optimized' builds 2. `assert` is defined as an empty macro in 'optimized' builds This means that when doing an 'optimized' build the declaration of `is_movk_to_zr` is included but the uses of it inside the call to `assert` gets elided. The fix works because: 1. ASSERT is not defined in 'optimized' builds 2. `assert` is defined as an empty macro in 'optimized' builds 3. A fortiori the call to `assert` is not compiled in anyway. So, with the patch the net result is that both the declaration of `is_movk_to_zr` and its use are omitted in 'optimized' builds. I'm approving the patch on that basis. can you confirm I have understood it correctly? ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9103 From mdoerr at openjdk.java.net Thu Jun 9 10:15:33 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 9 Jun 2022 10:15:33 GMT Subject: RFR: 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing [v2] In-Reply-To: References: Message-ID: <3RW5KQfes3mJlZN7--O7ZWA7CJxXkHBCGUEcV8JVbZI=.6d860146-1f6b-436e-8c2c-ee9170c3a620@github.com> On Thu, 9 Jun 2022 09:53:24 GMT, Martin Doerr wrote: >> `PopCountVI` needs to handle several types after [JDK-8284960](https://bugs.openjdk.java.net/browse/JDK-8284960). >> (See https://github.com/openjdk/jdk/commit/6f6486e97743eadfb20b4175e1b4b2b05b59a17a.) > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Support T_LONG instead of T_DOUBLE. Neither of these types are currently used. Thanks for the reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/8998 From mdoerr at openjdk.java.net Thu Jun 9 10:17:52 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 9 Jun 2022 10:17:52 GMT Subject: Integrated: 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing In-Reply-To: References: Message-ID: <-qCcTIT8jIXxuR3W9yGeUyWop4Vu8Uo_JrwpopZnll8=.30bffb03-a871-42a7-832d-84006bb08c9a@github.com> On Thu, 2 Jun 2022 16:08:11 GMT, Martin Doerr wrote: > `PopCountVI` needs to handle several types after [JDK-8284960](https://bugs.openjdk.java.net/browse/JDK-8284960). > (See https://github.com/openjdk/jdk/commit/6f6486e97743eadfb20b4175e1b4b2b05b59a17a.) This pull request has now been integrated. Changeset: 560e2927 Author: Martin Doerr URL: https://git.openjdk.java.net/jdk/commit/560e2927e380a372effdfe4a7260c3606bf74c8b Stats: 29 lines in 3 files changed: 25 ins; 1 del; 3 mod 8287738: [PPC64] jdk/incubator/vector/*VectorTests failing Reviewed-by: kvn, goetz ------------- PR: https://git.openjdk.java.net/jdk/pull/8998 From aph at openjdk.java.net Thu Jun 9 10:20:44 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 9 Jun 2022 10:20:44 GMT Subject: RFR: 8288078: linux-aarch64-optimized build fails in Tier5 after JDK-8287567 In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 09:33:24 GMT, Andrew Haley wrote: > Because neither ASSERT nor PRODUCT are defined in 'optimized' builds , this needs to be surrounded by ASSERT ifdefs, not PRODUCT. > @theRealAph > > > Because PRODUCT is defined in 'optimized' builds, this needs to be surrounded by ASSERT ifdefs, not PRODUCT. > > I am not sure that comment is correct even though I agree the fix is right. I'll provide my explanation just to be sure I have understood the fix. The error is happening because: > 1. `PRODUCT` is _not_ defined in 'optimized' builds > > 2. `assert` is defined as an empty macro in 'optimized' builds Exactly, thanks. I changed my summary text to fit. ------------- PR: https://git.openjdk.java.net/jdk/pull/9103 From aph at openjdk.java.net Thu Jun 9 10:58:02 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 9 Jun 2022 10:58:02 GMT Subject: RFR: 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long Message-ID: That's all. ------------- Commit messages: - 8287926: AArch64: Unsigned div and mod instructions Changes: https://git.openjdk.java.net/jdk/pull/9104/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9104&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287926 Stats: 64 lines in 1 file changed: 64 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/9104.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/9104/head:pull/9104 PR: https://git.openjdk.java.net/jdk/pull/9104 From duke at openjdk.java.net Thu Jun 9 11:07:35 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Thu, 9 Jun 2022 11:07:35 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. [v3] In-Reply-To: <2daUhb4sxxt35gxJAhWhJZJJfuYg3Z-srPgtIgafRqM=.ab110021-f967-4142-a22e-7d98e0c235ee@github.com> References: <2daUhb4sxxt35gxJAhWhJZJJfuYg3Z-srPgtIgafRqM=.ab110021-f967-4142-a22e-7d98e0c235ee@github.com> Message-ID: On Thu, 9 Jun 2022 07:24:12 GMT, Christian Hagedorn wrote: >> Swati Sharma has updated the pull request incrementally with two additional commits since the last revision: >> >> - Merge branch 'JDK-8287525' of https://github.com/swati-sha/jdk into JDK-8287525 >> - 8287525: Review comments resolved. > > test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 201: > >> 199: applyRules++; >> 200: TestFormat.checkNoThrow((irAnno.applyIfCPUFeatureOr().length % 2) == 0, >> 201: "Argument count for applyIfCPUFeatureOr should be multiple of two" + failAt()); > > The format check on L208 cannot be used like that anymore (I can somehow not comment on lines that are hidden): > > TestFormat.checkNoThrow(applyRules <= 1, > "Can only specify one apply constraint " + failAt()); > > We should be allowed to specify one `applyIf*` flag constraint together with one `applyIfCPUFeature*` constraint while not being allowed to specify multiple `applyIf*` flag constraints and/or multiple `applyIfCPUFeature*` constraints. Do you suggest commenting out check on L208 to enable application of multiple applyIf* constraints OR should that be handled in a separate RFE. ------------- PR: https://git.openjdk.java.net/jdk/pull/8999 From thartmann at openjdk.java.net Thu Jun 9 11:11:32 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 9 Jun 2022 11:11:32 GMT Subject: RFR: 8288078: linux-aarch64-optimized build fails in Tier5 after JDK-8287567 In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 09:33:24 GMT, Andrew Haley wrote: > Because neither ASSERT nor PRODUCT are defined in 'optimized' builds , this needs to be surrounded by ASSERT ifdefs, not PRODUCT. Looks reasonable. Would be nice to get this in before the fork later today. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/9103 From aph at openjdk.java.net Thu Jun 9 11:20:24 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 9 Jun 2022 11:20:24 GMT Subject: RFR: 8288078: linux-aarch64-optimized build fails in Tier5 after JDK-8287567 In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 11:08:53 GMT, Tobias Hartmann wrote: > Looks reasonable. Would be nice to get this in before the fork later today. I know, but tests are still running. It looks safe to me. ------------- PR: https://git.openjdk.java.net/jdk/pull/9103 From chagedorn at openjdk.java.net Thu Jun 9 11:28:37 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 9 Jun 2022 11:28:37 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. [v3] In-Reply-To: References: <2daUhb4sxxt35gxJAhWhJZJJfuYg3Z-srPgtIgafRqM=.ab110021-f967-4142-a22e-7d98e0c235ee@github.com> Message-ID: On Thu, 9 Jun 2022 11:03:54 GMT, Swati Sharma wrote: >> test/hotspot/jtreg/compiler/lib/ir_framework/test/IREncodingPrinter.java line 201: >> >>> 199: applyRules++; >>> 200: TestFormat.checkNoThrow((irAnno.applyIfCPUFeatureOr().length % 2) == 0, >>> 201: "Argument count for applyIfCPUFeatureOr should be multiple of two" + failAt()); >> >> The format check on L208 cannot be used like that anymore (I can somehow not comment on lines that are hidden): >> >> TestFormat.checkNoThrow(applyRules <= 1, >> "Can only specify one apply constraint " + failAt()); >> >> We should be allowed to specify one `applyIf*` flag constraint together with one `applyIfCPUFeature*` constraint while not being allowed to specify multiple `applyIf*` flag constraints and/or multiple `applyIfCPUFeature*` constraints. > > Do you suggest commenting out check on L208 to enable application of multiple applyIf* constraints OR should that be handled in a separate RFE. I think we should keep it as specifying multiple constraint attributes of the same kind (flag or CPU feature) should not be allowed. But we need separate the count (`applyRules`) for the flag based constraints and the CPU feature based constraints. You could rename `applyRules` -> `flagConstraints` and change the ones belonging to the new CPU features to `cpuFeatureConstraints`. Then you can have these checks here, for example: TestFormat.checkNoThrow(flagConstraints <= 1, "Can only specify one flag constraint" + failAt()); TestFormat.checkNoThrow(cpuFeatureConstraints <= 1, "Can only specify one CPU feature constraint" + failAt()); ------------- PR: https://git.openjdk.java.net/jdk/pull/8999 From duke at openjdk.java.net Thu Jun 9 11:48:42 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Thu, 9 Jun 2022 11:48:42 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. [v4] In-Reply-To: References: Message-ID: <8uc_yjQ281Km7c6u-bjaVY_xG4N6c7L06CV76sGsPC0=.f0ffa0ef-9189-47f2-9b97-a75306d1920a@github.com> > Hi All, > > Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. > > This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. > > Patch extends existing @IR annotation with following two new options:- > > - applyIfTargetFeatureAnd: > Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. > - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. > > Example usage: > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) > > Please review and share your feedback. > > Thanks, > Swati Swati Sharma has updated the pull request incrementally with one additional commit since the last revision: 8287525: Resolved review comments. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8999/files - new: https://git.openjdk.java.net/jdk/pull/8999/files/c7fd621b..5f02f608 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=02-03 Stats: 7 lines in 2 files changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.java.net/jdk/pull/8999.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8999/head:pull/8999 PR: https://git.openjdk.java.net/jdk/pull/8999 From redestad at openjdk.java.net Thu Jun 9 13:14:38 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Thu, 9 Jun 2022 13:14:38 GMT Subject: RFR: 8287903: Reduce runtime of java.math microbenchmarks In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 12:34:25 GMT, Claes Redestad wrote: > - Reduce runtime by running fewer forks, fewer iterations, less warmup. All micros tested in this group appear to stabilize very quickly. > - Refactor BigIntegers to avoid re-running some (most) micros over and over with parameter values that don't affect them. > > Expected runtime down from 14 hours to 15 minutes. Thanks for reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/9062 From redestad at openjdk.java.net Thu Jun 9 13:14:40 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Thu, 9 Jun 2022 13:14:40 GMT Subject: Integrated: 8287903: Reduce runtime of java.math microbenchmarks In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 12:34:25 GMT, Claes Redestad wrote: > - Reduce runtime by running fewer forks, fewer iterations, less warmup. All micros tested in this group appear to stabilize very quickly. > - Refactor BigIntegers to avoid re-running some (most) micros over and over with parameter values that don't affect them. > > Expected runtime down from 14 hours to 15 minutes. This pull request has now been integrated. Changeset: 7e948f7c Author: Claes Redestad URL: https://git.openjdk.java.net/jdk/commit/7e948f7ccbb4b9be04f5ecb65cc8dd72e3b495f4 Stats: 104 lines in 4 files changed: 66 ins; 33 del; 5 mod 8287903: Reduce runtime of java.math microbenchmarks Reviewed-by: ecaspole, aph ------------- PR: https://git.openjdk.java.net/jdk/pull/9062 From smonteith at openjdk.java.net Thu Jun 9 13:35:31 2022 From: smonteith at openjdk.java.net (Stuart Monteith) Date: Thu, 9 Jun 2022 13:35:31 GMT Subject: RFR: 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long In-Reply-To: References: Message-ID: <3nNWjgL7gB2CMoFl9YLB-e4shVn1FUOYJcTipPlhadA=.cf5b78d5-cca1-49c8-97e2-bc07ebafe03c@github.com> On Thu, 9 Jun 2022 10:47:42 GMT, Andrew Haley wrote: > That's all. Apart from the curious unbalanced "(", looks OK to me. src/hotspot/cpu/aarch64/aarch64.ad line 11415: > 11413: ins_cost(INSN_COST * 22); > 11414: format %{ "udivw rscratch1, $src1, $src2\n\t" > 11415: "msubw($dst, rscratch1, $src2, $src1" %} These new strings and the existing signed have this unbalanced "(" . Is there any reason why this is here? ------------- PR: https://git.openjdk.java.net/jdk/pull/9104 From adinn at openjdk.java.net Thu Jun 9 14:07:26 2022 From: adinn at openjdk.java.net (Andrew Dinn) Date: Thu, 9 Jun 2022 14:07:26 GMT Subject: RFR: 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 10:47:42 GMT, Andrew Haley wrote: > That's all. Looks ok apart from the paren issue spotted by Stuart ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9104 From aph at openjdk.java.net Thu Jun 9 14:17:39 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 9 Jun 2022 14:17:39 GMT Subject: Integrated: 8288078: linux-aarch64-optimized build fails in Tier5 after JDK-8287567 In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 09:33:24 GMT, Andrew Haley wrote: > Because neither ASSERT nor PRODUCT are defined in 'optimized' builds , this needs to be surrounded by ASSERT ifdefs, not PRODUCT. This pull request has now been integrated. Changeset: db4405d0 Author: Andrew Haley URL: https://git.openjdk.org/jdk/commit/db4405d0f880dd43dc7da0b81bc2da2619d315b0 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8288078: linux-aarch64-optimized build fails in Tier5 after JDK-8287567 Reviewed-by: adinn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9103 From aph at openjdk.java.net Thu Jun 9 14:38:35 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 9 Jun 2022 14:38:35 GMT Subject: RFR: 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long In-Reply-To: <3nNWjgL7gB2CMoFl9YLB-e4shVn1FUOYJcTipPlhadA=.cf5b78d5-cca1-49c8-97e2-bc07ebafe03c@github.com> References: <3nNWjgL7gB2CMoFl9YLB-e4shVn1FUOYJcTipPlhadA=.cf5b78d5-cca1-49c8-97e2-bc07ebafe03c@github.com> Message-ID: On Thu, 9 Jun 2022 13:31:21 GMT, Stuart Monteith wrote: >> That's all. > > src/hotspot/cpu/aarch64/aarch64.ad line 11415: > >> 11413: ins_cost(INSN_COST * 22); >> 11414: format %{ "udivw rscratch1, $src1, $src2\n\t" >> 11415: "msubw($dst, rscratch1, $src2, $src1" %} > > These new strings and the existing signed have this unbalanced "(" . Is there any reason why this is here? Oops, pasto. Will fix. ------------- PR: https://git.openjdk.org/jdk/pull/9104 From aph at openjdk.java.net Thu Jun 9 14:53:42 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 9 Jun 2022 14:53:42 GMT Subject: RFR: 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v2] In-Reply-To: References: Message-ID: > That's all. Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: Fix opto assembly in integer mod patterns. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9104/files - new: https://git.openjdk.org/jdk/pull/9104/files/9af63d08..7fe1cbd3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9104&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9104&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/9104.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9104/head:pull/9104 PR: https://git.openjdk.org/jdk/pull/9104 From kvn at openjdk.java.net Thu Jun 9 18:03:26 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 9 Jun 2022 18:03:26 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v8] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Thu, 9 Jun 2022 02:54:15 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: > > - Update to the latest JDK and fix the function name > > Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc > - Merge branch 'master' into fg8283091 > > Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 > - Add assertion line for opcode() and withdraw some common code as a function > > Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe > - Merge branch 'master' into fg8283091 > > Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - ... and 1 more: https://git.openjdk.org/jdk/compare/230726ea...0d731bb2 Good. I submitted testing. Please consider adding IR framework test to make sure expected vector nodes are generated. ------------- PR: https://git.openjdk.org/jdk/pull/7806 From sviswanathan at openjdk.java.net Thu Jun 9 18:18:05 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 9 Jun 2022 18:18:05 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 03:03:47 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix extra space > > Results are good. > You need second review. @vnkozlov Could I go ahead and integrate? There were some minor changes and code rearrangement after your last test. Please let me know. ------------- PR: https://git.openjdk.org/jdk/pull/9032 From kvn at openjdk.java.net Thu Jun 9 18:44:07 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 9 Jun 2022 18:44:07 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:26:58 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review commit resolution I submitted new testing. ------------- PR: https://git.openjdk.org/jdk/pull/9032 From sviswanathan at openjdk.java.net Thu Jun 9 20:25:02 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 9 Jun 2022 20:25:02 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:26:58 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review commit resolution Thanks a lot Vladimir! ------------- PR: https://git.openjdk.org/jdk/pull/9032 From kvn at openjdk.java.net Thu Jun 9 23:51:16 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 9 Jun 2022 23:51:16 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v8] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Thu, 9 Jun 2022 02:54:15 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: > > - Update to the latest JDK and fix the function name > > Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc > - Merge branch 'master' into fg8283091 > > Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 > - Add assertion line for opcode() and withdraw some common code as a function > > Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe > - Merge branch 'master' into fg8283091 > > Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - ... and 1 more: https://git.openjdk.org/jdk/compare/230726ea...0d731bb2 Testing results are good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/7806 From sviswanathan at openjdk.java.net Fri Jun 10 00:09:00 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 10 Jun 2022 00:09:00 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Wed, 8 Jun 2022 09:39:04 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . >> >> Thank you very much. > > I have added a benchmark for the intrinsic. The result is as follows, thanks a lot: > > Before After > Benchmark (size) Mode Cnt Score Error Score Error Units > Integers.compareUnsigned 500 avgt 15 0.527 ? 0.002 0.498 ? 0.011 us/op > Longs.compareUnsigned 500 avgt 15 0.677 ? 0.014 0.561 ? 0.006 us/op @merykitty Could you please also add the micro benchmark where compareUnsigned result is stored directly in an integer and show the performance of that? ------------- PR: https://git.openjdk.org/jdk/pull/9068 From kvn at openjdk.java.net Fri Jun 10 01:27:03 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 10 Jun 2022 01:27:03 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 20:21:52 GMT, Sandhya Viswanathan wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Review commit resolution > > Thanks a lot Vladimir! @sviswa7 testing results are good. You can push. ------------- PR: https://git.openjdk.org/jdk/pull/9032 From kvn at openjdk.java.net Fri Jun 10 01:36:05 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 10 Jun 2022 01:36:05 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Fri, 20 May 2022 16:27:51 GMT, Evgeny Astigeevich wrote: > ## Problem > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides the address of the stub and the address of its owner. `relocInfo` has an offset which is an offset from the previously known relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites. > With shared stubs there will be multiple relocations having the same stub address but different owners' addresses. > If we try to generate relocations as we go there will be a case which requires negative offsets: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > > `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward. > Negative offsets reduce the offset range by half. This can increase filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > ## Solution > In this PR creation of stubs is done in two stages. First we collect requests for creating shared stubs: a callee `ciMethod*` and an offset of a call in `CodeBuffer` (see [src/hotspot/share/asm/codeBuffer.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-deb8ab083311ba60c0016dc34d6518579bbee4683c81e8d348982bac897fe8ae)). Then we have the finalisation phase (see [src/hotspot/share/ci/ciEnv.cpp](https://github.com/openjdk/jdk/pull/8816/files#diff-7c032de54e85754d39e080fd24d49b7469543b163f54229eb0631c6b1bf26450)), where `CodeBuffer::finalize_stubs()` creates shared stubs in `CodeBuffer`: a stub and multiple relocations sharing it. The first relocation will have positive offset. The rest will have zero offsets. This approach does not need negative offsets. As creation of relocations and stubs is platform dependent, `CodeBuffer::finalize_stubs()` calls `CodeBuffer::pd_finalize_stubs()` where platforms should put their code. > > This PR provides implementations for x86, x86_64 and aarch64. [src/hotspot/share/asm/codeBuffer.inline.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-c268e3719578f2980edaa27c0eacbe9f620124310108eb65d0f765212c7042eb) provides the `emit_shared_stubs_to_interp` template which x86, x86_64 and aarch64 platforms use. Other platforms can use it too. Platforms supporting shared stubs to the interpreter must have `CodeBuffer::supports_shared_stubs()` returning `true`. > > ## Results > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 820544 | 4592 | 18872 | > | dec-tree | 405280 | 2580 | 22335 | > | naive-bayes | 392384 | 2586 | 21184 | > | log-regression | 362208 | 2450 | 20325 | > | als | 306048 | 2226 | 18161 | > | finagle-chirper | 262304 | 2087 | 12675 | > | movie-lens | 250112 | 1937 | 13617 | > | gauss-mix | 173792 | 1262 | 10304 | > | finagle-http | 164320 | 1392 | 11269 | > | page-rank | 155424 | 1175 | 10330 | > | chi-square | 140384 | 1028 | 9480 | > | akka-uct | 115136 | 541 | 3941 | > | reactors | 43264 | 335 | 2503 | > | scala-stm-bench7 | 42656 | 326 | 3310 | > | philosophers | 36576 | 256 | 2902 | > | scala-doku | 35008 | 231 | 2695 | > | rx-scrabble | 32416 | 273 | 2789 | > | future-genetic | 29408 | 260 | 2339 | > | scrabble | 27968 | 225 | 2477 | > | par-mnemonics | 19584 | 168 | 1689 | > | fj-kmeans | 19296 | 156 | 1647 | > | scala-kmeans | 18080 | 140 | 1629 | > | mnemonics | 17408 | 143 | 1512 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 337065 | 4403 | 19135 | > | dec-tree | 183045 | 2559 | 22071 | > | naive-bayes | 176460 | 2450 | 19782 | > | log-regression | 162555 | 2410 | 20648 | > | als | 121275 | 1980 | 17179 | > | movie-lens | 111915 | 1842 | 13020 | > | finagle-chirper | 106350 | 1947 | 12726 | > | gauss-mix | 81975 | 1251 | 10474 | > | finagle-http | 80895 | 1523 | 12294 | > | page-rank | 68940 | 1146 | 10124 | > | chi-square | 62130 | 974 | 9315 | > | akka-uct | 50220 | 555 | 4263 | > | reactors | 23385 | 371 | 2544 | > | philosophers | 17625 | 259 | 2865 | > | scala-stm-bench7 | 17235 | 295 | 3230 | > | scala-doku | 15600 | 214 | 2698 | > | rx-scrabble | 14190 | 262 | 2770 | > | future-genetic | 13155 | 253 | 2318 | > | scrabble | 12300 | 217 | 2352 | > | fj-kmeans | 8985 | 157 | 1616 | > | par-mnemonics | 8535 | 155 | 1684 | > | scala-kmeans | 8250 | 138 | 1624 | > | mnemonics | 7485 | 134 | 1522 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed GHA testing is not clean. I looked through changes and they seem logically correct. Need more testing. I will wait when GHA is clean. ------------- PR: https://git.openjdk.org/jdk/pull/8816 From jbhateja at openjdk.java.net Fri Jun 10 03:26:01 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 10 Jun 2022 03:26:01 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:28:54 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/x86.ad line 7388: >> >>> 7386: case T_BYTE: >>> 7387: __ evpmovsqd($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); >>> 7388: __ evpmovdb($dst$$XMMRegister, $dst$$XMMRegister, vlen_enc); >> >> Sub-word handling can be extended for AVX2 using packing instruction sequence similar to VectorStoreMask for quad ward lanes. > > D2X in general needs AVX 512 due to evcvttpd2qq. Thanks @sviswa7 , for AVX we can use VCVTTPD2DQ to cast double precison lane to integer and subsequently to sub words lanes. For casting to long we do not have direct instruction. ------------- PR: https://git.openjdk.org/jdk/pull/9032 From ksakata at openjdk.java.net Fri Jun 10 06:09:33 2022 From: ksakata at openjdk.java.net (Koichi Sakata) Date: Fri, 10 Jun 2022 06:09:33 GMT Subject: RFR: 8283612: IGV: Remove Graal module Message-ID: This pull request removes Graal module in Ideal Graph Visualizer. I think you might know that Graal JIT compiler was removed from OpenJDK in version 17. So IGV doesn't need to support Graal's dump files any more. It is noted that GraalVM has its own version of IGV (https://www.graalvm.org/22.1/tools/igv/). It seems that there are no test cases related to Graal module. I've built IGV, run it and opened the graphs. Those were all successful. ------------- Commit messages: - Remove Graal module Changes: https://git.openjdk.org/jdk/pull/9119/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9119&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8283612 Stats: 423 lines in 16 files changed: 0 ins; 423 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9119.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9119/head:pull/9119 PR: https://git.openjdk.org/jdk/pull/9119 From rcastanedalo at openjdk.java.net Fri Jun 10 06:50:15 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 10 Jun 2022 06:50:15 GMT Subject: RFR: 8283612: IGV: Remove Graal module In-Reply-To: References: Message-ID: On Fri, 10 Jun 2022 06:01:48 GMT, Koichi Sakata wrote: > This pull request removes Graal module in Ideal Graph Visualizer. I think you might know that Graal JIT compiler was removed from OpenJDK in version 17. So IGV doesn't need to support Graal's dump files any more. It is noted that GraalVM has its own version of IGV (https://www.graalvm.org/22.1/tools/igv/). > > It seems that there are no test cases related to Graal module. I've built IGV, run it and opened the graphs. Those were all successful. Looks good! Thank you for this and your previous contributions to IGV, Koichi. ------------- Marked as reviewed by rcastanedalo (Committer). PR: https://git.openjdk.org/jdk/pull/9119 From chagedorn at openjdk.java.net Fri Jun 10 07:21:06 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 10 Jun 2022 07:21:06 GMT Subject: RFR: 8283612: IGV: Remove Graal module In-Reply-To: References: Message-ID: On Fri, 10 Jun 2022 06:01:48 GMT, Koichi Sakata wrote: > This pull request removes Graal module in Ideal Graph Visualizer. I think you might know that Graal JIT compiler was removed from OpenJDK in version 17. So IGV doesn't need to support Graal's dump files any more. It is noted that GraalVM has its own version of IGV (https://www.graalvm.org/22.1/tools/igv/). > > It seems that there are no test cases related to Graal module. I've built IGV, run it and opened the graphs. Those were all successful. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9119 From jbhateja at openjdk.java.net Fri Jun 10 07:42:07 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 10 Jun 2022 07:42:07 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: On Wed, 8 Jun 2022 16:26:58 GMT, Sandhya Viswanathan wrote: >> Currently the C2 JIT only supports float -> int and double -> long conversion for x86. >> This PR adds the support for following conversions in the c2 JIT: >> float -> long, short, byte >> double -> int, short, byte >> >> The performance gain is as follows. >> Before the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms >> >> After the patch: >> Benchmark Mode Cnt Score Error Units >> VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms >> VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms >> VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms >> VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms >> VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms >> VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms >> VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms >> VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Review commit resolution test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 79: > 77: float_arr[i] = ran.nextFloat(); > 78: double_arr[i] = ran.nextDouble(); > 79: } Can you kindly also add special floating point values NaN, +/-Inf, +/-0.0 to input array to cover your special handling code changes. ------------- PR: https://git.openjdk.org/jdk/pull/9032 From ksakata at openjdk.java.net Fri Jun 10 08:01:06 2022 From: ksakata at openjdk.java.net (Koichi Sakata) Date: Fri, 10 Jun 2022 08:01:06 GMT Subject: RFR: 8283612: IGV: Remove Graal module In-Reply-To: References: Message-ID: On Fri, 10 Jun 2022 06:01:48 GMT, Koichi Sakata wrote: > This pull request removes Graal module in Ideal Graph Visualizer. I think you might know that Graal JIT compiler was removed from OpenJDK in version 17. So IGV doesn't need to support Graal's dump files any more. It is noted that GraalVM has its own version of IGV (https://www.graalvm.org/22.1/tools/igv/). > > It seems that there are no test cases related to Graal module. I've built IGV, run it and opened the graphs. Those were all successful. Thank you for the reviews, Roberto and Christian! ------------- PR: https://git.openjdk.org/jdk/pull/9119 From thartmann at openjdk.java.net Fri Jun 10 08:17:06 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 10 Jun 2022 08:17:06 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v3] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 14:42:39 GMT, Roland Westrelin wrote: >> This is another small enhancement for a code shape that showed up in a >> MemorySegment micro benchmark. The shape to optimize is the one from test1: >> >> >> for (int i = 0; i < size; i++) { >> long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> if (((base + j) & 3) != 0) { >> throw new RuntimeException(); >> } >> >> v += UNSAFE.getInt(base + j); >> } >> >> >> In that code shape, the loop iv is first scaled, result is then casted >> to long, range checked and finally address of memory location is >> computed. >> >> The alignment check is transformed so the loop body has no check In >> order to eliminate the range check, that loop is transformed into: >> >> >> for (int i1 = ..) { >> for (int i2 = ..) { >> long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> v += UNSAFE.getInt(base + j); >> } >> } >> >> >> The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... >> >> In this case, the type of the ConvI2L is [min_jint, max_jint] and type >> of CastLL is [0, max_jint] (the CastLL has a narrower type). >> >> I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that >> case. The convI2L and CastII types can be set to [0, max_jint]. The >> new address shape is then: >> >> (AddP base (ConvI2L (CastII (LShiftI (AddI ... >> >> which optimize well. >> >> (LShiftI (AddI ... >> is transformed into >> (AddI (LShiftI ... >> because one of the AddI input is loop invariant (i2) and we have: >> >> (AddP base (ConvI2L (CastII (AddI (LShiftI ... >> >> Then because the ConvI2L and CastII types are [0, max_jint], the AddI >> is pushed through the ConvI2L and CastII: >> >> (AddP base (AddL (ConvI2L (CastII (LShiftI ... >> >> base and one of the inputs of the AddL are loop invariant so this >> transformed into: >> >> (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... >> >> The (AddP ...) is loop invariant so computed before entry. The >> (ConvI2L ...) only depends on the loop iv. >> >> The resulting address is a shift + an add. The address before >> transformation requires 2 adds + a shift. Also after unrolling, the >> adress of the second access in the loop is cheaper to compute as it >> can be derived from the address of the first access. >> >> For all of this to work: >> 1) I added a CastLL::Ideal transformation: >> (CastLL (ConvI2L into (ConvI2l (CastII >> >> 2) I also had to prevent split if to transform (LShiftI (Phi for the >> iv Phi of a counted loop. >> >> >> test2 and test3 test 1) and 2) separately. > > Roland Westrelin has updated the pull request incrementally with three additional commits since the last revision: > > - Update test/hotspot/jtreg/compiler/c2/irTests/TestConvI2LCastLongLoop.java > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/castnode.hpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/castnode.cpp > > Co-authored-by: Tobias Hartmann Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/8555 From roland at openjdk.java.net Fri Jun 10 08:20:39 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 10 Jun 2022 08:20:39 GMT Subject: Integrated: 8286197: C2: Optimize MemorySegment shape in int loop In-Reply-To: References: Message-ID: On Thu, 5 May 2022 14:57:11 GMT, Roland Westrelin wrote: > This is another small enhancement for a code shape that showed up in a > MemorySegment micro benchmark. The shape to optimize is the one from test1: > > > for (int i = 0; i < size; i++) { > long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; > > j = Objects.checkIndex(j, size * 4); > > if (((base + j) & 3) != 0) { > throw new RuntimeException(); > } > > v += UNSAFE.getInt(base + j); > } > > > In that code shape, the loop iv is first scaled, result is then casted > to long, range checked and finally address of memory location is > computed. > > The alignment check is transformed so the loop body has no check In > order to eliminate the range check, that loop is transformed into: > > > for (int i1 = ..) { > for (int i2 = ..) { > long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; > > j = Objects.checkIndex(j, size * 4); > > v += UNSAFE.getInt(base + j); > } > } > > > The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... > > In this case, the type of the ConvI2L is [min_jint, max_jint] and type > of CastLL is [0, max_jint] (the CastLL has a narrower type). > > I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that > case. The convI2L and CastII types can be set to [0, max_jint]. The > new address shape is then: > > (AddP base (ConvI2L (CastII (LShiftI (AddI ... > > which optimize well. > > (LShiftI (AddI ... > is transformed into > (AddI (LShiftI ... > because one of the AddI input is loop invariant (i2) and we have: > > (AddP base (ConvI2L (CastII (AddI (LShiftI ... > > Then because the ConvI2L and CastII types are [0, max_jint], the AddI > is pushed through the ConvI2L and CastII: > > (AddP base (AddL (ConvI2L (CastII (LShiftI ... > > base and one of the inputs of the AddL are loop invariant so this > transformed into: > > (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... > > The (AddP ...) is loop invariant so computed before entry. The > (ConvI2L ...) only depends on the loop iv. > > The resulting address is a shift + an add. The address before > transformation requires 2 adds + a shift. Also after unrolling, the > adress of the second access in the loop is cheaper to compute as it > can be derived from the address of the first access. > > For all of this to work: > 1) I added a CastLL::Ideal transformation: > (CastLL (ConvI2L into (ConvI2l (CastII > > 2) I also had to prevent split if to transform (LShiftI (Phi for the > iv Phi of a counted loop. > > > test2 and test3 test 1) and 2) separately. This pull request has now been integrated. Changeset: dae4c493 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/dae4c493e48b6bb942cf6f629f1ff8839e32e54a Stats: 167 lines in 5 files changed: 167 ins; 0 del; 0 mod 8286197: C2: Optimize MemorySegment shape in int loop Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/8555 From duke at openjdk.java.net Fri Jun 10 11:25:11 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Fri, 10 Jun 2022 11:25:11 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. [v5] In-Reply-To: References: Message-ID: > Hi All, > > Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. > > This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. > > Patch extends existing @IR annotation with following two new options:- > > - applyIfTargetFeatureAnd: > Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. > - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. > > Example usage: > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) > @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) > > Please review and share your feedback. > > Thanks, > Swati Swati Sharma has updated the pull request incrementally with one additional commit since the last revision: 8287525: Resolved review comments. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8999/files - new: https://git.openjdk.org/jdk/pull/8999/files/5f02f608..f7dcf317 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8999&range=03-04 Stats: 18 lines in 1 file changed: 1 ins; 0 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/8999.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8999/head:pull/8999 PR: https://git.openjdk.org/jdk/pull/8999 From duke at openjdk.java.net Fri Jun 10 11:39:08 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 10 Jun 2022 11:39:08 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Fri, 10 Jun 2022 01:33:56 GMT, Vladimir Kozlov wrote: > GHA testing is not clean. > > I looked through changes and they seem logically correct. Need more testing. I will wait when GHA is clean. Thank you, Vladimir. ------------- PR: https://git.openjdk.org/jdk/pull/8816 From sviswanathan at openjdk.java.net Fri Jun 10 17:05:31 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 10 Jun 2022 17:05:31 GMT Subject: Integrated: 8287835: Add support for additional float/double to integral conversion for x86 In-Reply-To: References: Message-ID: On Sat, 4 Jun 2022 22:13:32 GMT, Sandhya Viswanathan wrote: > Currently the C2 JIT only supports float -> int and double -> long conversion for x86. > This PR adds the support for following conversions in the c2 JIT: > float -> long, short, byte > double -> int, short, byte > > The performance gain is as follows. > Before the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 32367.971 ? 6161.118 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 25825.251 ? 5417.104 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59641.958 ? 17307.177 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 29641.505 ? 12023.015 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 16271.224 ? 1523.083 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59199.994 ? 14357.959 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 17169.197 ? 1738.273 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 14934.139 ? 2329.253 ops/ms > > After the patch: > Benchmark Mode Cnt Score Error Units > VectorFPtoIntCastOperations.microDouble2Byte thrpt 3 115436.659 ? 21282.364 ops/ms > VectorFPtoIntCastOperations.microDouble2Int thrpt 3 87194.395 ? 9443.106 ops/ms > VectorFPtoIntCastOperations.microDouble2Long thrpt 3 59652.356 ? 7240.721 ops/ms > VectorFPtoIntCastOperations.microDouble2Short thrpt 3 110570.719 ? 10401.620 ops/ms > VectorFPtoIntCastOperations.microFloat2Byte thrpt 3 110028.539 ? 11113.137 ops/ms > VectorFPtoIntCastOperations.microFloat2Int thrpt 3 59469.193 ? 18272.495 ops/ms > VectorFPtoIntCastOperations.microFloat2Long thrpt 3 59897.101 ? 7249.268 ops/ms > VectorFPtoIntCastOperations.microFloat2Short thrpt 3 86167.554 ? 8253.232 ops/ms > > Please review. > > Best Regards, > Sandhya This pull request has now been integrated. Changeset: 2cc40afa Author: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/2cc40afa075b1cf749db98d5a6c6cb1c548ba85d Stats: 474 lines in 7 files changed: 461 ins; 0 del; 13 mod 8287835: Add support for additional float/double to integral conversion for x86 Reviewed-by: kvn, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/9032 From sviswanathan at openjdk.java.net Fri Jun 10 17:25:08 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 10 Jun 2022 17:25:08 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v5] In-Reply-To: References: Message-ID: On Fri, 10 Jun 2022 03:22:19 GMT, Jatin Bhateja wrote: >> D2X in general needs AVX 512 due to evcvttpd2qq. > > Thanks @sviswa7 , for AVX we can use VCVTTPD2DQ to cast double precison lane to integer and subsequently to sub words lanes. For casting to long we do not have direct instruction. Thanks @jatin-bhateja. I have updated the RFE (https://bugs.openjdk.org/browse/JDK-8288043) to include this. ------------- PR: https://git.openjdk.org/jdk/pull/9032 From sviswanathan at openjdk.java.net Fri Jun 10 17:35:13 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 10 Jun 2022 17:35:13 GMT Subject: RFR: 8287835: Add support for additional float/double to integral conversion for x86 [v6] In-Reply-To: References: Message-ID: On Fri, 10 Jun 2022 07:37:59 GMT, Jatin Bhateja wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Review commit resolution > > test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 79: > >> 77: float_arr[i] = ran.nextFloat(); >> 78: double_arr[i] = ran.nextDouble(); >> 79: } > > Can you kindly also add special floating point values NaN, +/-Inf, +/-0.0 to input array to cover your special handling code changes. @jatin-bhateja The test is only checking the IR node generation for x86. The rest of the actual functionality test is already covered under the following including the special cases: compiler/codegen/TestByteDoubleVect.java compiler/codegen/TestByteFloatVect.java compiler/codegen/TestShortFloatVect.java compiler/codegen/TestShortDoubleVect.java compiler/codegen/TestLongFloatVect.java compiler/codegen/TestIntDoubleVect.java compiler/codegen/TestIntFloatVect.java The general idea of this PR was to complement x86 FP to integral conversion along with https://git.openjdk.org/jdk/pull/7806 from Fei Gao. ------------- PR: https://git.openjdk.org/jdk/pull/9032 From xgong at openjdk.java.net Mon Jun 13 01:53:32 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Mon, 13 Jun 2022 01:53:32 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v2] In-Reply-To: References: Message-ID: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Revert transformation from MaskAll to VectorMaskGen, address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9037/files - new: https://git.openjdk.org/jdk/pull/9037/files/008a2a16..c71db592 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=9037&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=9037&range=00-01 Stats: 35 lines in 5 files changed: 5 ins; 16 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.java.net Mon Jun 13 01:53:32 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Mon, 13 Jun 2022 01:53:32 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v2] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 19:58:37 GMT, Vladimir Kozlov wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert transformation from MaskAll to VectorMaskGen, address review comments > > Changes I significant. I suggest to wait JDK 20 (next week). Hi @vnkozlov , all your review comments are resolved. Could you please help to take a look at this PR again? Thanks so much! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.java.net Mon Jun 13 01:53:32 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Mon, 13 Jun 2022 01:53:32 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v2] In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 01:23:49 GMT, Xiaohong Gong wrote: >> I think changes in #8877 influences the max vector length in superword? And since `MaskAll` is used for VectorAPI, the `MaxVectorSize` is still the right reference? @jatin-bhateja, could you please help to check whether this has any influence on x86 avx-512 system? Thanks so much! >> >>> And I don't see in(2)->Opcode() == Op_VectorMaskGen check. >> >> Yes, the `Op_VectorMaskGen` is not generated for `MaskAll` when its input is a constant. We directly transform the `MaskAll` to `VectorMaskGen` here, since they two have the same meanings. Thanks! > >>> And I don't see in(2)->Opcode() == Op_VectorMaskGen check. > >>Yes, the Op_VectorMaskGen is not generated for MaskAll when its input is a constant. We directly transform the MaskAll to VectorMaskGen here, since they two have the same meanings. Thanks! > > I'm sorry that my comment in line-1819 is not right which misunderstood you. I will change this later. Thanks! I prefer to not transform `MaskAll` to `VectorMaskGen` now, since there are the match rules using `MaskAll m1` both in sve and avx-512. Doing the transformation may influence those rules. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From chagedorn at openjdk.java.net Mon Jun 13 07:44:07 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 13 Jun 2022 07:44:07 GMT Subject: RFR: 8287525: Extend IR annotation with new options to test specific target feature. [v5] In-Reply-To: References: Message-ID: On Fri, 10 Jun 2022 11:25:11 GMT, Swati Sharma wrote: >> Hi All, >> >> Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. >> >> This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. >> >> Patch extends existing @IR annotation with following two new options:- >> >> - applyIfTargetFeatureAnd: >> Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. >> - applyIfTargetFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. >> >> Example usage: >> @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureOr = {"avx512bw", "true", "avx512f", "true"}) >> @IR(counts = {"AddVI", "> 0"}, applyIfTargetFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) >> >> Please review and share your feedback. >> >> Thanks, >> Swati > > Swati Sharma has updated the pull request incrementally with one additional commit since the last revision: > > 8287525: Resolved review comments. That looks good to me. Thanks for doing all the updates! I'll submit some testing. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8999 From ngasson at openjdk.java.net Mon Jun 13 08:24:06 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Mon, 13 Jun 2022 08:24:06 GMT Subject: RFR: 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v2] In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 14:53:42 GMT, Andrew Haley wrote: >> That's all. > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > Fix opto assembly in integer mod patterns. Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9104 From ksakata at openjdk.java.net Mon Jun 13 08:27:07 2022 From: ksakata at openjdk.java.net (Koichi Sakata) Date: Mon, 13 Jun 2022 08:27:07 GMT Subject: Integrated: 8283612: IGV: Remove Graal module In-Reply-To: References: Message-ID: On Fri, 10 Jun 2022 06:01:48 GMT, Koichi Sakata wrote: > This pull request removes Graal module in Ideal Graph Visualizer. I think you might know that Graal JIT compiler was removed from OpenJDK in version 17. So IGV doesn't need to support Graal's dump files any more. It is noted that GraalVM has its own version of IGV (https://www.graalvm.org/22.1/tools/igv/). > > It seems that there are no test cases related to Graal module. I've built IGV, run it and opened the graphs. Those were all successful. This pull request has now been integrated. Changeset: ac28be72 Author: Koichi Sakata Committer: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/ac28be721feb2d14120132f6b289ca436acf0406 Stats: 423 lines in 16 files changed: 0 ins; 423 del; 0 mod 8283612: IGV: Remove Graal module Reviewed-by: rcastanedalo, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/9119 From ngasson at openjdk.java.net Mon Jun 13 08:27:07 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Mon, 13 Jun 2022 08:27:07 GMT Subject: RFR: 8287028: AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 [v2] In-Reply-To: References: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> Message-ID: On Thu, 2 Jun 2022 08:24:56 GMT, Eric Liu wrote: >> This patch implements AArch64 codegen for VectorLongToMask using the >> SVE2 BitPerm feature. With this patch, the final code (generated on an >> SVE vector reg size of 512-bit QEMU emulator) is shown as below: >> >> mov z17.b, #0 >> mov v17.d[0], x13 >> sunpklo z17.h, z17.b >> sunpklo z17.s, z17.h >> sunpklo z17.d, z17.s >> mov z16.b, #1 >> bdep z17.d, z17.d, z16.d >> cmpne p0.b, p7/z, z17.b, #0 > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge jdk:master > > Change-Id: I7cea9b028f60c447f7cc24a00d38f59e0f07ecd3 > - AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 > > This patch implements AArch64 codegen for VectorLongToMask using the > SVE2 BitPerm feature. With this patch, the final code (generated on an > SVE vector reg size of 512-bit QEMU emulator) is shown as below: > > mov z17.b, #0 > mov v17.d[0], x13 > sunpklo z17.h, z17.b > sunpklo z17.s, z17.h > sunpklo z17.d, z17.s > mov z16.b, #1 > bdep z17.d, z17.d, z16.d > cmpne p0.b, p7/z, z17.b, #0 > > Change-Id: I9135fce39c8a08c72b757c78b258f5d968baa7ff Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/8789 From adinn at openjdk.java.net Mon Jun 13 08:50:08 2022 From: adinn at openjdk.java.net (Andrew Dinn) Date: Mon, 13 Jun 2022 08:50:08 GMT Subject: RFR: 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v2] In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 14:53:42 GMT, Andrew Haley wrote: >> That's all. > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > Fix opto assembly in integer mod patterns. Still good Marked as reviewed by adinn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9104 From ksakata at openjdk.java.net Mon Jun 13 08:59:05 2022 From: ksakata at openjdk.java.net (Koichi Sakata) Date: Mon, 13 Jun 2022 08:59:05 GMT Subject: RFR: 8283612: IGV: Remove Graal module In-Reply-To: References: Message-ID: On Fri, 10 Jun 2022 06:01:48 GMT, Koichi Sakata wrote: > This pull request removes Graal module in Ideal Graph Visualizer. I think you might know that Graal JIT compiler was removed from OpenJDK in version 17. So IGV doesn't need to support Graal's dump files any more. It is noted that GraalVM has its own version of IGV (https://www.graalvm.org/22.1/tools/igv/). > > It seems that there are no test cases related to Graal module. I've built IGV, run it and opened the graphs. Those were all successful. Thank you for sponsoring, Roberto! ------------- PR: https://git.openjdk.org/jdk/pull/9119 From chagedorn at openjdk.java.net Mon Jun 13 09:00:09 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 13 Jun 2022 09:00:09 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump In-Reply-To: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Thu, 2 Jun 2022 09:16:28 GMT, Emanuel Peter wrote: > **Goal** > Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. > > **Proposal** > Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). > > Thus, I present these additional functions: > `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. > `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. > The nodes are sorted by node idx, and then dumped. > > Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) > > **Usecase** > Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. > > Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. > Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. > Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. > > Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. > > You can probably find more usecases yourself ;) Otherwise, looks good! These are useful new `find` methods to have! src/hotspot/share/opto/node.cpp line 1623: > 1621: return _worklist[i]; > 1622: } > 1623: size_t size() { You should add some new lines to better separate the methods. src/hotspot/share/opto/node.cpp line 1627: > 1625: } > 1626: private: > 1627: uint _index = 0; Is not used anymore and can be removed. src/hotspot/share/opto/node.cpp line 1703: > 1701: char buf[N]; // copy parts of pattern into this > 1702: const char* s = str; > 1703: const char* r = &pattern[0]; // cast array to char* You can directly use `pattern` which is the same as `&pattern[0]` Suggestion: const char* r = pattern; Maybe you could also use a more descriptive name for `s` and `r`. Maybe `str_index` and `pattern_index`, respectively? src/hotspot/share/opto/node.cpp line 1712: > 1710: strncpy(buf, r, r_part_len); > 1711: buf[r_part_len] = '\0'; // end of string > 1712: r_part = &buf[0]; // cast array to char* Suggestion: r_part = buf; ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8988 From rcastanedalo at openjdk.java.net Mon Jun 13 09:39:09 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 13 Jun 2022 09:39:09 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump In-Reply-To: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Thu, 2 Jun 2022 09:16:28 GMT, Emanuel Peter wrote: > **Goal** > Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. > > **Proposal** > Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). > > Thus, I present these additional functions: > `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. > `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. > The nodes are sorted by node idx, and then dumped. > > Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) > > **Usecase** > Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. > > Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. > Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. > Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. > > Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. > > You can probably find more usecases yourself ;) Thanks for adding this functionality, I tested it and works well! src/hotspot/share/opto/node.cpp line 1692: > 1690: } > 1691: return nullptr; // no i was a match > 1692: } Suggesting: move this code to `src/hotspot/share/utilities/stringUtils.cpp`. ------------- Marked as reviewed by rcastanedalo (Committer). PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Mon Jun 13 10:52:04 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Mon, 13 Jun 2022 10:52:04 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Mon, 13 Jun 2022 08:18:28 GMT, Christian Hagedorn wrote: >> **Goal** >> Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. >> >> **Proposal** >> Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). >> >> Thus, I present these additional functions: >> `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. >> `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. >> The nodes are sorted by node idx, and then dumped. >> >> Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) >> >> **Usecase** >> Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. >> >> Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. >> Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. >> Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. >> >> Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. >> >> You can probably find more usecases yourself ;) > > src/hotspot/share/opto/node.cpp line 1623: > >> 1621: return _worklist[i]; >> 1622: } >> 1623: size_t size() { > > You should add some new lines to better separate the methods. done, thanks > src/hotspot/share/opto/node.cpp line 1627: > >> 1625: } >> 1626: private: >> 1627: uint _index = 0; > > Is not used anymore and can be removed. removed, thanks ------------- PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Mon Jun 13 10:59:00 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Mon, 13 Jun 2022 10:59:00 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Mon, 13 Jun 2022 08:40:26 GMT, Christian Hagedorn wrote: >> **Goal** >> Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. >> >> **Proposal** >> Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). >> >> Thus, I present these additional functions: >> `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. >> `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. >> The nodes are sorted by node idx, and then dumped. >> >> Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) >> >> **Usecase** >> Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. >> >> Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. >> Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. >> Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. >> >> Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. >> >> You can probably find more usecases yourself ;) > > src/hotspot/share/opto/node.cpp line 1703: > >> 1701: char buf[N]; // copy parts of pattern into this >> 1702: const char* s = str; >> 1703: const char* r = &pattern[0]; // cast array to char* > > You can directly use `pattern` which is the same as `&pattern[0]` > Suggestion: > > const char* r = pattern; > > Maybe you could also use a more descriptive name for `s` and `r`. Maybe `str_index` and `pattern_index`, respectively? renamed the variables even more conprehensively, thanks for the hint! ------------- PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Mon Jun 13 11:19:53 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Mon, 13 Jun 2022 11:19:53 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: <_-_KbqTMsoDRyWgT12XJsK3opi_3D4SiuHraE2vKJQQ=.b428fcdd-f2fb-494b-9034-a676edcbce37@github.com> On Mon, 13 Jun 2022 09:33:31 GMT, Roberto Casta?eda Lozano wrote: >> **Goal** >> Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. >> >> **Proposal** >> Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). >> >> Thus, I present these additional functions: >> `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. >> `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. >> The nodes are sorted by node idx, and then dumped. >> >> Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) >> >> **Usecase** >> Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. >> >> Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. >> Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. >> Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. >> >> Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. >> >> You can probably find more usecases yourself ;) > > src/hotspot/share/opto/node.cpp line 1692: > >> 1690: } >> 1691: return nullptr; // no i was a match >> 1692: } > > Suggesting: move this code to `src/hotspot/share/utilities/stringUtils.cpp`. moving the two string functions to that file, thanks for the suggestion! ------------- PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Mon Jun 13 11:29:16 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Mon, 13 Jun 2022 11:29:16 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v2] In-Reply-To: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: > **Goal** > Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. > > **Proposal** > Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). > > Thus, I present these additional functions: > `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. > `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. > The nodes are sorted by node idx, and then dumped. > > Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) > > **Usecase** > Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. > > Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. > Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. > Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. > > Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. > > You can probably find more usecases yourself ;) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: changes responding to review by Christian and Roberto ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8988/files - new: https://git.openjdk.org/jdk/pull/8988/files/44c1ee26..4a6530f6 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=00-01 Stats: 127 lines in 3 files changed: 67 ins; 57 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/8988.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8988/head:pull/8988 PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Mon Jun 13 11:29:17 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Mon, 13 Jun 2022 11:29:17 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v2] In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Mon, 13 Jun 2022 08:37:59 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> changes responding to review by Christian and Roberto > > src/hotspot/share/opto/node.cpp line 1712: > >> 1710: strncpy(buf, r, r_part_len); >> 1711: buf[r_part_len] = '\0'; // end of string >> 1712: r_part = &buf[0]; // cast array to char* > > Suggestion: > > r_part = buf; checking if I can do that, remember I had some issues with a platform. ------------- PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Mon Jun 13 11:42:49 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Mon, 13 Jun 2022 11:42:49 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v3] In-Reply-To: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: > **Goal** > Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. > > **Proposal** > Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). > > Thus, I present these additional functions: > `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. > `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. > The nodes are sorted by node idx, and then dumped. > > Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) > > **Usecase** > Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. > > Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. > Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. > Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. > > Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. > > You can probably find more usecases yourself ;) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix header issues ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8988/files - new: https://git.openjdk.org/jdk/pull/8988/files/4a6530f6..425515f3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=01-02 Stats: 3 lines in 2 files changed: 2 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/8988.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8988/head:pull/8988 PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Mon Jun 13 11:49:01 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Mon, 13 Jun 2022 11:49:01 GMT Subject: Integrated: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 13:04:55 GMT, Emanuel Peter wrote: > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > To get familiar with the many options, run this to get help: > `find_node(0)->dump_bfs(0,0,"h")` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > dist dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > dist dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > dist dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(280)->dump_bfs(2,0,"cdmxo+ at B") > dist [block head idom depth] old dump > --------------------------------------------- > 2 B6 379 377 4 o118 38 sarI_rReg_CL === _ 39 40 [[ 41 36 31 31 71 75 66 82 86 103 116 148 152 161 119 119 184 186 170 281 268 ]] !jvms: String::length @ bci:9 (line 1487) ByteVector::putUTF8 @ bci:1 (line 285) > 2 B52 441 277 23 o738 283 incI_rReg === _ 285 [[ 284 285 281 ]] #1/0x00000001 !jvms: ByteVector::putUTF8 @ bci:131 (line 300) > 2 B50 277 439 22 o756 282 IfTrue === 273 [[ 441 ]] #1 !jvms: ByteVector::putUTF8 @ bci:100 (line 302) > 1 B52 441 277 23 o737 281 compI_rReg === _ 283 38 [[ 280 ]] > 1 B52 441 277 23 _ 441 Region === 441 282 [[ 441 280 290 ]] > 0 B52 441 277 23 o757 280 jmpLoopEnd === 441 281 [[ 279 347 ]] P=0.500000, C=21462.000000 !jvms: ByteVector::putUTF8 @ bci:79 (line 300) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > dist dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **Exploring loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > dist dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > dist apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > dist apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > dist apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. This pull request has now been integrated. Changeset: 33ed0365 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/33ed0365c3ed182a9d063e1701fe69bfb72dfa2e Stats: 746 lines in 4 files changed: 714 ins; 0 del; 32 mod 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering Reviewed-by: kvn, chagedorn, thartmann, rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/8468 From shade at openjdk.java.net Mon Jun 13 14:37:54 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 13 Jun 2022 14:37:54 GMT Subject: RFR: 8288303: C1: Miscompilation due to broken Class.getModifiers intrinsic Message-ID: Looks like another instance when compilicated control flow in C1 LIR intrinsic confuses the C1 regalloc into miscompiling. Reliably reproduces on selected JFR tests in selected configurations, and I was unable to reproduce it in smaller test. Additional testing: - [x] JFR reproducers now pass - [ ] Linux x86_64 fastdebug `tier1` - [ ] Linux x86_32 fastdebug `tier1` - [ ] Linux x86_64 fastdebug `tier2` - [ ] Linux x86_32 fastdebug `tier2` ------------- Commit messages: - Fix - Fully branchless version - Do stuff without any branches - Another fix - Fix Changes: https://git.openjdk.org/jdk19/pull/8/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk19&pr=8&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288303 Stats: 23 lines in 1 file changed: 12 ins; 5 del; 6 mod Patch: https://git.openjdk.org/jdk19/pull/8.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/8/head:pull/8 PR: https://git.openjdk.org/jdk19/pull/8 From aph at openjdk.java.net Mon Jun 13 14:38:31 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 13 Jun 2022 14:38:31 GMT Subject: Integrated: 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 10:47:42 GMT, Andrew Haley wrote: > That's all. This pull request has now been integrated. Changeset: 0207d761 Author: Andrew Haley URL: https://git.openjdk.org/jdk/commit/0207d761f45c85dbcdc509bbba9e73bbe5d19329 Stats: 66 lines in 1 file changed: 64 ins; 0 del; 2 mod 8287926: AArch64: intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long Reviewed-by: adinn, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/9104 From iveresov at openjdk.java.net Mon Jun 13 14:41:11 2022 From: iveresov at openjdk.java.net (Igor Veresov) Date: Mon, 13 Jun 2022 14:41:11 GMT Subject: RFR: 8288303: C1: Miscompilation due to broken Class.getModifiers intrinsic In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 13:21:17 GMT, Aleksey Shipilev wrote: > Looks like another instance when compilicated control flow in C1 LIR intrinsic confuses the C1 regalloc into miscompiling. Reliably reproduces on selected JFR tests in selected configurations, and I was unable to reproduce it in smaller test. > > Additional testing: > - [x] JFR reproducers now pass > - [ ] Linux x86_64 fastdebug `tier1` > - [ ] Linux x86_32 fastdebug `tier1` > - [ ] Linux x86_64 fastdebug `tier2` > - [ ] Linux x86_32 fastdebug `tier2` Marked as reviewed by iveresov (Reviewer). ------------- PR: https://git.openjdk.org/jdk19/pull/8 From kvn at openjdk.java.net Mon Jun 13 23:31:03 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 13 Jun 2022 23:31:03 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 20:42:22 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > monior change for code style. Update looks good. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8545 From fgao at openjdk.java.net Tue Jun 14 01:49:34 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Tue, 14 Jun 2022 01:49:34 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v9] In-Reply-To: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> > After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. > > After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). > > After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. > > Here is the test data (-XX:+UseSuperWord) on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: - Add an IR framework testcase Change-Id: Ifbcc8d233aa27dfe93acef548c7e42721d86376e - Merge branch 'master' into fg8283091 Change-Id: I9525ae9310c3c493da29490d034cbb8f223e7f80 - Update to the latest JDK and fix the function name Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc - Merge branch 'master' into fg8283091 Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 - Add assertion line for opcode() and withdraw some common code as a function Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe - Merge branch 'master' into fg8283091 Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 - Implement an interface for auto-vectorization to consult supported match rules Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 - Merge branch 'master' into fg8283091 Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd - Merge branch 'master' into fg8283091 Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f - Merge branch 'master' into fg8283091 Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 - ... and 3 more: https://git.openjdk.org/jdk/compare/f1143b1b...49e6f56e ------------- Changes: https://git.openjdk.org/jdk/pull/7806/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=08 Stats: 1379 lines in 23 files changed: 1320 ins; 13 del; 46 mod Patch: https://git.openjdk.org/jdk/pull/7806.diff Fetch: git fetch https://git.openjdk.org/jdk pull/7806/head:pull/7806 PR: https://git.openjdk.org/jdk/pull/7806 From fgao at openjdk.java.net Tue Jun 14 01:49:34 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Tue, 14 Jun 2022 01:49:34 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v8] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: <_px7mnvzqUBnQGTJ8h4edvONokBWMS4gsvulhodjZ0w=.bd63c8ac-0a8b-4c86-92ab-fa094e732208@github.com> On Thu, 9 Jun 2022 18:01:05 GMT, Vladimir Kozlov wrote: > Please consider adding IR framework test to make sure expected vector nodes are generated. Added an IR framework testcase and updated to the latest JDK. Thanks for your review @vnkozlov . ------------- PR: https://git.openjdk.org/jdk/pull/7806 From eliu at openjdk.java.net Tue Jun 14 03:42:49 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 14 Jun 2022 03:42:49 GMT Subject: Integrated: 8287028: AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 In-Reply-To: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> References: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> Message-ID: On Thu, 19 May 2022 14:08:05 GMT, Eric Liu wrote: > This patch implements AArch64 codegen for VectorLongToMask using the > SVE2 BitPerm feature. With this patch, the final code (generated on an > SVE vector reg size of 512-bit QEMU emulator) is shown as below: > > mov z17.b, #0 > mov v17.d[0], x13 > sunpklo z17.h, z17.b > sunpklo z17.s, z17.h > sunpklo z17.d, z17.s > mov z16.b, #1 > bdep z17.d, z17.d, z16.d > cmpne p0.b, p7/z, z17.b, #0 This pull request has now been integrated. Changeset: 86c9241c Author: Eric Liu Committer: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/86c9241cce50dfdaf1dcd2c218ecc8e5f5af3918 Stats: 133 lines in 8 files changed: 101 ins; 0 del; 32 mod 8287028: AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 Reviewed-by: xgong, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/8789 From duke at openjdk.java.net Tue Jun 14 06:16:58 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Tue, 14 Jun 2022 06:16:58 GMT Subject: Integrated: 8287525: Extend IR annotation with new options to test specific target feature. In-Reply-To: References: Message-ID: <6lwhUeYHCkLC7qH_VojeUAEoht3MYXD4G1uNiyNfKA4=.c13fd0c2-d110-48ef-b540-fcf5620b456c@github.com> On Thu, 2 Jun 2022 17:17:21 GMT, Swati Sharma wrote: > Hi All, > > Currently test invocations are guarded by @requires vm.cpu.feature tags which are specified as the part of test tag specifications. This results into generating multiple test cases if some test points in a test file needs to be guarded by a specific features while others should still be executed in absence of missing target feature. > > This is specially important for IR checks based validation since C2 IR nodes creation may heavily rely on existence of specific target feature. Also, test harness executes test points only if all the constraints specified in tag specifications are met, thus imposing an OR semantics b/w @requires tag based CPU features becomes tricky. > > Patch extends existing @IR annotation with following two new options:- > > - applyIfCPUFeatureAnd: > Accepts a list of feature pairs where each pair is composed of target feature string followed by a true/false value where a true value necessities existence of target feature and vice-versa. IR verifications checks are enforced only if all the specified feature constraints are met. > - applyIfCPUFeatureOr: Accepts similar arguments as above option but IR verifications checks are enforced only when at least one of the specified feature constraints are met. > > Example usage: > @IR(counts = {IRNode.ADD_VI, "> 0"}, applyIfCPUFeatureOr = {"avx512bw", "true", "avx512f", "true"}) > @IR(counts = {IRNode.ADD_VI, "> 0"}, applyIfCPUFeatureAnd = {"avx512bw", "true", "avx512f", "true"}) > > Please review and share your feedback. > > Thanks, > Swati This pull request has now been integrated. Changeset: 03dca565 Author: Swati Sharma Committer: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/03dca565cfcb3fb65a69ac6c59f062f1eeef87ac Stats: 250 lines in 5 files changed: 237 ins; 0 del; 13 mod 8287525: Extend IR annotation with new options to test specific target feature. Co-authored-by: Jatin Bhateja Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/8999 From xgong at openjdk.java.net Tue Jun 14 06:19:54 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 14 Jun 2022 06:19:54 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v2] In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 18:11:49 GMT, Vladimir Kozlov wrote: >> We have other vectornodes like `VectorMaskCmp` , `MaskAll` and `VectorLoadMask` also needs to append the mask here. Actually most masked vector nodes accept the mask input except for the load/store/gather/scatter. And in future, we may extend this to other normal vector nodes whose vector length is full-size while not partial, since SVE always needs a predicate for most instructions. So the default patch will be used for most vector nodes. > > Add comment about that (default is used for most vector nodes) to avoid confusion in a future. OK, I will add the comment later. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From thartmann at openjdk.java.net Tue Jun 14 06:28:41 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 14 Jun 2022 06:28:41 GMT Subject: RFR: 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes In-Reply-To: References: Message-ID: <6lRORqYtgy_b2ITrWsVd0RwTd3UwDl6SrOq0M0vNDUM=.a7238f96-0ab7-4b2f-b5eb-bdcc10b10e3a@github.com> On Mon, 13 Jun 2022 21:18:31 GMT, Vladimir Ivanov wrote: > ciInstanceKlass::implementor() doesn't cache the result for well-known interfaces (`is_shared() == true`). Due to concurrent class loading, compilers can observe a change in reported unique implementor (in the worst case: from having no implementors to having one, then to having many) thus introducing paradoxical situations during a compilation. > > What makes it very hard/impossible to trigger the bug is there's only a single well-known interface (`java.util.Iterable`) present as of now, which gets multiple implementors loaded early during startup. > > Testing: hs-tier1 - hs-tier2 Looks good to me. Since the bug is a P2 and we are in RDP 1, we either need to re-target the fix to JDK 19 or explicitly defer to JDK 20 (see https://openjdk.org/jeps/3). ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9147 From thartmann at openjdk.java.net Tue Jun 14 06:38:57 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 14 Jun 2022 06:38:57 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: <97FCXaW1O6h6Tp7nqNPprVP-NIA9cvAN6KhuvCGNrrA=.580d2914-b336-404e-a98b-22aa08723366@github.com> On Tue, 24 May 2022 15:08:28 GMT, Erik ?sterlund wrote: >>> @vnkozlov Is your concern that a user explicitly overrides the default to a value that ends up not being good? If so, I'm not sure why we would be in the business of preventing the user from shooting itself in the foot and guessing what the user really wanted here. Maybe I missed something. >> >> Yes, it was my concern which was unjustifiable because I missed that this code is guarded by `FLAG_IS_DEFAULT(SweeperThreshold)`. So you simply set `SweeperThreshold` to 5% (default is 0.5) which is fine. > >> > @vnkozlov Is your concern that a user explicitly overrides the default to a value that ends up not being good? If so, I'm not sure why we would be in the business of preventing the user from shooting itself in the foot and guessing what the user really wanted here. Maybe I missed something. >> >> >> >> Yes, it was my concern which was unjustifiable because I missed that this code is guarded by `FLAG_IS_DEFAULT(SweeperThreshold)`. So you simply set `SweeperThreshold` to 5% (default is 0.5) which is fine. > > Okay great - thanks for the review! @fisk This PR needs to be re-submitted for the JDK 19 repository but given it was triaged as P4, it might be too late. ------------- PR: https://git.openjdk.org/jdk/pull/8673 From dlong at openjdk.java.net Tue Jun 14 06:57:42 2022 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 14 Jun 2022 06:57:42 GMT Subject: RFR: 8288303: C1: Miscompilation due to broken Class.getModifiers intrinsic [v2] In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 17:14:19 GMT, Aleksey Shipilev wrote: >> Looks like another instance when compilicated control flow in C1 LIR intrinsic confuses the C1 regalloc into miscompiling. Reliably reproduces on selected JFR tests in selected configurations, and I was unable to reproduce it in smaller test. >> >> Additional testing: >> - [x] JFR reproducers now pass >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_32 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] Linux x86_32 fastdebug `tier2` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Add test The change looks OK. However, it does seem useful for an intrinsic to be able to use "local" labels that won't confuse the register allocator. That seems better than using a less-efficient cmove or having to write the intrinsic at the LIRAssembler level. ------------- Marked as reviewed by dlong (Reviewer). PR: https://git.openjdk.org/jdk19/pull/8 From roland at openjdk.java.net Tue Jun 14 07:02:49 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 14 Jun 2022 07:02:49 GMT Subject: RFR: 8288022: c2: Transform (CastLL (AddL into (AddL (CastLL when possible In-Reply-To: References: Message-ID: <6Um8YN4T6bSQpHawKb8Rc1dA_URc2OBWRoCnee7pOyU=.d2e167a1-c66e-4085-95fb-b90a1ad6a273@github.com> On Mon, 13 Jun 2022 08:26:47 GMT, Roland Westrelin wrote: > This implements a transformation that already exists for CastII and > ConvI2L and helps code generation. The tricky part is that: > > (CastII (AddI into (AddI (CastII > > is performed by first computing the bounds of the type of the AddI. To > protect against overflow, jlong variables are used. With CastLL/AddL > nodes there's no larger integer type to promote the bounds to. As a > consequence the logic in the patch explicitly tests for overflow. That > logic is shared by the int and long cases. The previous logic for the > int cases that promotes values to long is used as verification. > > This patch also widens the type of CastLL nodes after loop opts the > way it's done for CastII/ConvI2L to allow commoning of nodes. > > This was observed to help with Memory Segment micro benchmarks. Commenting because RFR email didn't go out ------------- PR: https://git.openjdk.org/jdk/pull/9139 From thartmann at openjdk.java.net Tue Jun 14 07:07:46 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 14 Jun 2022 07:07:46 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v3] In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Mon, 13 Jun 2022 11:42:49 GMT, Emanuel Peter wrote: >> **Goal** >> Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. >> >> **Proposal** >> Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). >> >> Thus, I present these additional functions: >> `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. >> `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. >> The nodes are sorted by node idx, and then dumped. >> >> Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) >> >> **Usecase** >> Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. >> >> Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. >> Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. >> Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. >> >> Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. >> >> You can probably find more usecases yourself ;) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix header issues Very nice! src/hotspot/share/opto/node.cpp line 1789: > 1787: // not found or if the node to be found is not a control node (search will not find it). > 1788: Node* Node::find(const int idx, bool only_ctrl) { > 1789: ResourceMark rm; Do we need a `ResourceMark` in the new find methods as well? `UniqueMixedNodeList::_worklist` is a `ResourceObj`. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Tue Jun 14 07:19:50 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Tue, 14 Jun 2022 07:19:50 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v3] In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Mon, 13 Jun 2022 11:24:50 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/node.cpp line 1712: >> >>> 1710: strncpy(buf, r, r_part_len); >>> 1711: buf[r_part_len] = '\0'; // end of string >>> 1712: r_part = &buf[0]; // cast array to char* >> >> Suggestion: >> >> r_part = buf; > > checking if I can do that, remember I had some issues with a platform. Tests passed fine, so I will do as you suggested. Thanks for the feedback. ------------- PR: https://git.openjdk.org/jdk/pull/8988 From rcastanedalo at openjdk.java.net Tue Jun 14 07:51:06 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 14 Jun 2022 07:51:06 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v3] In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Mon, 13 Jun 2022 11:42:49 GMT, Emanuel Peter wrote: >> **Goal** >> Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. >> >> **Proposal** >> Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). >> >> Thus, I present these additional functions: >> `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. >> `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. >> The nodes are sorted by node idx, and then dumped. >> >> Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) >> >> **Usecase** >> Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. >> >> Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. >> Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. >> Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. >> >> Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. >> >> You can probably find more usecases yourself ;) > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix header issues Thanks for addressing my suggestion, Emanuel! Please merge the latest changes from master and re-test: after the integration of JDK-8283775, `node_idx_cmp` is already defined in `node.cpp`, leading to a build failure when this patch is applied. ------------- Changes requested by rcastanedalo (Committer). PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Tue Jun 14 07:59:41 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Tue, 14 Jun 2022 07:59:41 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v4] In-Reply-To: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: > **Goal** > Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. > > **Proposal** > Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). > > Thus, I present these additional functions: > `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. > `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. > The nodes are sorted by node idx, and then dumped. > > Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) > > **Usecase** > Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. > > Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. > Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. > Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. > > Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. > > You can probably find more usecases yourself ;) Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: adding resource marks, moving Unique_Mixed_Node_List to node.hpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8988/files - new: https://git.openjdk.org/jdk/pull/8988/files/425515f3..52dd9917 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=02-03 Stats: 64 lines in 2 files changed: 33 ins; 30 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/8988.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8988/head:pull/8988 PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Tue Jun 14 08:16:36 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Tue, 14 Jun 2022 08:16:36 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v5] In-Reply-To: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: > **Goal** > Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. > > **Proposal** > Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). > > Thus, I present these additional functions: > `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. > `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. > The nodes are sorted by node idx, and then dumped. > > Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) > > **Usecase** > Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. > > Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. > Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. > Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. > > Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. > > You can probably find more usecases yourself ;) Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: - remove duplicate node_idx_cmp after merge - Merge branch 'master' into JDK-8287647 - adding resource marks, moving Unique_Mixed_Node_List to node.hpp - fix header issues - changes responding to review by Christian and Roberto - style fixes, and implemented case insensitive strstr - make matching case insensitive - missing include - ensure null termination - guard against long pattern, and fix array/pointer issues - ... and 4 more: https://git.openjdk.org/jdk/compare/b1999884...653a7cc3 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8988/files - new: https://git.openjdk.org/jdk/pull/8988/files/52dd9917..653a7cc3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8988&range=03-04 Stats: 40004 lines in 1097 files changed: 30462 ins; 5792 del; 3750 mod Patch: https://git.openjdk.org/jdk/pull/8988.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8988/head:pull/8988 PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Tue Jun 14 08:16:37 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Tue, 14 Jun 2022 08:16:37 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v3] In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Tue, 14 Jun 2022 06:44:02 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix header issues > > src/hotspot/share/opto/node.cpp line 1789: > >> 1787: // not found or if the node to be found is not a control node (search will not find it). >> 1788: Node* Node::find(const int idx, bool only_ctrl) { >> 1789: ResourceMark rm; > > Do we need a `ResourceMark` in the new find methods as well? `UniqueMixedNodeList::_worklist` is a `ResourceObj`. thanks for the suggestion, did that. ------------- PR: https://git.openjdk.org/jdk/pull/8988 From xgong at openjdk.java.net Tue Jun 14 08:59:38 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 14 Jun 2022 08:59:38 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Address review comments, revert changes for gatherL/scatterL rules - Merge branch 'jdk:master' into JDK-8286941 - Revert transformation from MaskAll to VectorMaskGen, address review comments - 8286941: Add mask IR for partial vector operations for ARM SVE ------------- Changes: https://git.openjdk.org/jdk/pull/9037/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=9037&range=02 Stats: 2030 lines in 19 files changed: 784 ins; 826 del; 420 mod Patch: https://git.openjdk.org/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.java.net Tue Jun 14 08:59:39 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 14 Jun 2022 08:59:39 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v2] In-Reply-To: References: Message-ID: <4aopunrT25G234SOCYLVv8SKFbYveRq-Z57Uh0iaguc=.8e6b9357-fd90-46bd-881b-3b1384a5ade2@github.com> On Mon, 13 Jun 2022 18:16:42 GMT, Vladimir Kozlov wrote: > Thank you for addressing my comments. You need review from Jatin and someone familiar with aarch64 code (I did not look on it). Sure, thanks a lot for the review! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.java.net Tue Jun 14 08:59:39 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 14 Jun 2022 08:59:39 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v2] In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 01:53:32 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Revert transformation from MaskAll to VectorMaskGen, address review comments Hi @jatin-bhateja, could you please help to take a look at this PR, especially the changes to the `LoadVectorMasked`? Any feedback is welcome, thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.java.net Tue Jun 14 08:59:39 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 14 Jun 2022 08:59:39 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: On Tue, 14 Jun 2022 06:16:54 GMT, Xiaohong Gong wrote: >> Add comment about that (default is used for most vector nodes) to avoid confusion in a future. > > OK, I will add the comment later. Thanks! Added the comment as suggested. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From shade at openjdk.java.net Tue Jun 14 09:49:51 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 14 Jun 2022 09:49:51 GMT Subject: RFR: 8288303: C1: Miscompilation due to broken Class.getModifiers intrinsic [v2] In-Reply-To: References: Message-ID: On Tue, 14 Jun 2022 06:54:37 GMT, Dean Long wrote: > The change looks OK. However, it does seem useful for an intrinsic to be able to use "local" labels that won't confuse the register allocator. That seems better than using a less-efficient cmove or having to write the intrinsic at the LIRAssembler level. Yeah, this is not the first time this happens. So I submitted [JDK-8288317](https://bugs.openjdk.org/browse/JDK-8288317) yesterday hoping for better diagnostics, at least. Maybe local labels would be nice to have, if they are implementable. ------------- PR: https://git.openjdk.org/jdk19/pull/8 From shade at openjdk.java.net Tue Jun 14 14:39:12 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 14 Jun 2022 14:39:12 GMT Subject: RFR: 8288303: C1: Miscompilation due to broken Class.getModifiers intrinsic [v2] In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 17:14:19 GMT, Aleksey Shipilev wrote: >> Looks like another instance when compilicated control flow in C1 LIR intrinsic confuses the C1 regalloc into miscompiling. Reliably reproduces on selected JFR tests in selected configurations, and I was unable to reproduce it in smaller test. >> >> Additional testing: >> - [x] JFR reproducers now pass >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_32 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] Linux x86_32 fastdebug `tier2` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Add test Thanks for reviews! ------------- PR: https://git.openjdk.org/jdk19/pull/8 From shade at openjdk.java.net Tue Jun 14 14:39:14 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 14 Jun 2022 14:39:14 GMT Subject: Integrated: 8288303: C1: Miscompilation due to broken Class.getModifiers intrinsic In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 13:21:17 GMT, Aleksey Shipilev wrote: > Looks like another instance when compilicated control flow in C1 LIR intrinsic confuses the C1 regalloc into miscompiling. Reliably reproduces on selected JFR tests in selected configurations, and I was unable to reproduce it in smaller test. > > Additional testing: > - [x] JFR reproducers now pass > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_32 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] Linux x86_32 fastdebug `tier2` This pull request has now been integrated. Changeset: 8cd87e73 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk19/commit/8cd87e731bcaff2d7838995c68056742d577ad3b Stats: 141 lines in 2 files changed: 130 ins; 5 del; 6 mod 8288303: C1: Miscompilation due to broken Class.getModifiers intrinsic Reviewed-by: iveresov, dlong ------------- PR: https://git.openjdk.org/jdk19/pull/8 From vlivanov at openjdk.java.net Tue Jun 14 17:56:55 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Tue, 14 Jun 2022 17:56:55 GMT Subject: RFR: 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes Message-ID: ciInstanceKlass::implementor() doesn't cache the result for well-known interfaces (is_shared() == true). Due to concurrent class loading, compilers can observe a change in reported unique implementor (in the worst case: from having no implementors to having one, then to having many) thus introducing paradoxical situations during a compilation. What makes it very hard/impossible to trigger the bug is there's only a single well-known interface (java.util.Iterable) present as of now, which gets multiple implementors loaded early during startup. Testing: hs-tier1 - hs-tier2 ------------- Commit messages: - 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes Changes: https://git.openjdk.org/jdk19/pull/15/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=15&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288360 Stats: 7 lines in 1 file changed: 2 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk19/pull/15.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/15/head:pull/15 PR: https://git.openjdk.org/jdk19/pull/15 From vlivanov at openjdk.java.net Tue Jun 14 17:57:31 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Tue, 14 Jun 2022 17:57:31 GMT Subject: RFR: 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 21:18:31 GMT, Vladimir Ivanov wrote: > ciInstanceKlass::implementor() doesn't cache the result for well-known interfaces (`is_shared() == true`). Due to concurrent class loading, compilers can observe a change in reported unique implementor (in the worst case: from having no implementors to having one, then to having many) thus introducing paradoxical situations during a compilation. > > What makes it very hard/impossible to trigger the bug is there's only a single well-known interface (`java.util.Iterable`) present as of now, which gets multiple implementors loaded early during startup. > > Testing: hs-tier1 - hs-tier2 Thanks for the review, Tobias. I've retargeted the fix to 19 (openjdk/jdk19#15). ------------- PR: https://git.openjdk.org/jdk/pull/9147 From vlivanov at openjdk.java.net Tue Jun 14 17:57:34 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Tue, 14 Jun 2022 17:57:34 GMT Subject: Withdrawn: 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 21:18:31 GMT, Vladimir Ivanov wrote: > ciInstanceKlass::implementor() doesn't cache the result for well-known interfaces (`is_shared() == true`). Due to concurrent class loading, compilers can observe a change in reported unique implementor (in the worst case: from having no implementors to having one, then to having many) thus introducing paradoxical situations during a compilation. > > What makes it very hard/impossible to trigger the bug is there's only a single well-known interface (`java.util.Iterable`) present as of now, which gets multiple implementors loaded early during startup. > > Testing: hs-tier1 - hs-tier2 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/9147 From thartmann at openjdk.java.net Tue Jun 14 18:16:03 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 14 Jun 2022 18:16:03 GMT Subject: RFR: 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes In-Reply-To: References: Message-ID: On Tue, 14 Jun 2022 17:38:25 GMT, Vladimir Ivanov wrote: > ciInstanceKlass::implementor() doesn't cache the result for well-known interfaces (is_shared() == true). Due to concurrent class loading, compilers can observe a change in reported unique implementor (in the worst case: from having no implementors to having one, then to having many) thus introducing paradoxical situations during a compilation. > > What makes it very hard/impossible to trigger the bug is there's only a single well-known interface (java.util.Iterable) present as of now, which gets multiple implementors loaded early during startup. > > Testing: hs-tier1 - hs-tier2 Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/15 From kvn at openjdk.java.net Tue Jun 14 22:29:47 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 14 Jun 2022 22:29:47 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v9] In-Reply-To: <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> Message-ID: On Tue, 14 Jun 2022 01:49:34 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Add an IR framework testcase > > Change-Id: Ifbcc8d233aa27dfe93acef548c7e42721d86376e > - Merge branch 'master' into fg8283091 > > Change-Id: I9525ae9310c3c493da29490d034cbb8f223e7f80 > - Update to the latest JDK and fix the function name > > Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc > - Merge branch 'master' into fg8283091 > > Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 > - Add assertion line for opcode() and withdraw some common code as a function > > Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe > - Merge branch 'master' into fg8283091 > > Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - ... and 3 more: https://git.openjdk.org/jdk/compare/f1143b1b...49e6f56e Thank you for test. @sviswa7 can you review latest changes to have 2 reviews? ------------- PR: https://git.openjdk.org/jdk/pull/7806 From kvn at openjdk.java.net Tue Jun 14 22:32:37 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 14 Jun 2022 22:32:37 GMT Subject: RFR: 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes In-Reply-To: References: Message-ID: On Tue, 14 Jun 2022 17:38:25 GMT, Vladimir Ivanov wrote: > ciInstanceKlass::implementor() doesn't cache the result for well-known interfaces (is_shared() == true). Due to concurrent class loading, compilers can observe a change in reported unique implementor (in the worst case: from having no implementors to having one, then to having many) thus introducing paradoxical situations during a compilation. > > What makes it very hard/impossible to trigger the bug is there's only a single well-known interface (java.util.Iterable) present as of now, which gets multiple implementors loaded early during startup. > > Testing: hs-tier1 - hs-tier2 Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/15 From vlivanov at openjdk.java.net Tue Jun 14 22:40:42 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Tue, 14 Jun 2022 22:40:42 GMT Subject: RFR: 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes In-Reply-To: References: Message-ID: On Tue, 14 Jun 2022 17:38:25 GMT, Vladimir Ivanov wrote: > ciInstanceKlass::implementor() doesn't cache the result for well-known interfaces (is_shared() == true). Due to concurrent class loading, compilers can observe a change in reported unique implementor (in the worst case: from having no implementors to having one, then to having many) thus introducing paradoxical situations during a compilation. > > What makes it very hard/impossible to trigger the bug is there's only a single well-known interface (java.util.Iterable) present as of now, which gets multiple implementors loaded early during startup. > > Testing: hs-tier1 - hs-tier2 Thanks for the reviews, Tobias and Vladimir. ------------- PR: https://git.openjdk.org/jdk19/pull/15 From vlivanov at openjdk.java.net Tue Jun 14 22:40:43 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Tue, 14 Jun 2022 22:40:43 GMT Subject: Integrated: 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes In-Reply-To: References: Message-ID: On Tue, 14 Jun 2022 17:38:25 GMT, Vladimir Ivanov wrote: > ciInstanceKlass::implementor() doesn't cache the result for well-known interfaces (is_shared() == true). Due to concurrent class loading, compilers can observe a change in reported unique implementor (in the worst case: from having no implementors to having one, then to having many) thus introducing paradoxical situations during a compilation. > > What makes it very hard/impossible to trigger the bug is there's only a single well-known interface (java.util.Iterable) present as of now, which gets multiple implementors loaded early during startup. > > Testing: hs-tier1 - hs-tier2 This pull request has now been integrated. Changeset: 50f99c32 Author: Vladimir Ivanov URL: https://git.openjdk.org/jdk19/commit/50f99c3208fc9f479cc109fb6e73d262e27026a2 Stats: 7 lines in 1 file changed: 2 ins; 2 del; 3 mod 8288360: CI: ciInstanceKlass::implementor() is not consistent for well-known classes Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk19/pull/15 From haosun at openjdk.java.net Wed Jun 15 04:17:00 2022 From: haosun at openjdk.java.net (Hao Sun) Date: Wed, 15 Jun 2022 04:17:00 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) Message-ID: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> The assertion, i.e. src and dst must be different registers, was introduced years ago. But I don't think it's needed. This limitation was added in [1]. Frankly speaking, I don't know the reason. But I guess the assertion is probably used for debugging, raising one warning of fmovs/fmovd usage in the scenario of moving element at index zero from one **vector** register, to one float-point scalar register. If the "src" vector register and the "dst" float-point scalar register are the same one, it introduces a side-effect, i.e. the higher bits are cleared to zeros[2]. If so, I argue that 1) the assembler should align with the ISA. 2) compiler developers should be aware of the side-effect when they want to use fmovs/fmovd, and they should guarantee "dst != src" if they like to higher bits untouched, e.g., [3]. Hence, I think we can remove this unnecessary assertion. [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 ------------- Commit messages: - 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) Changes: https://git.openjdk.org/jdk/pull/9163/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9163&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288300 Stats: 22 lines in 1 file changed: 0 ins; 14 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9163.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9163/head:pull/9163 PR: https://git.openjdk.org/jdk/pull/9163 From thartmann at openjdk.java.net Wed Jun 15 05:58:43 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 15 Jun 2022 05:58:43 GMT Subject: RFR: JDK-8287349: AArch64: Merge LDR instructions to improve C1 OSR performance [v2] In-Reply-To: References: Message-ID: On Tue, 31 May 2022 07:47:24 GMT, Zhuojun Miao wrote: >> Since MacroAssembler added merge_ldst, we can use different >> destination registers for contiguous-memory LDR instructions to improve performance. > > Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8287349 > - use ldp explicitly > - JDK-8287349: Merge LDR instructions to improve C1 OSR performance Submitted testing, all passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/8933 From zmiao at openjdk.java.net Wed Jun 15 06:01:06 2022 From: zmiao at openjdk.java.net (Zhuojun Miao) Date: Wed, 15 Jun 2022 06:01:06 GMT Subject: Integrated: JDK-8287349: AArch64: Merge LDR instructions to improve C1 OSR performance In-Reply-To: References: Message-ID: On Sat, 28 May 2022 06:28:57 GMT, Zhuojun Miao wrote: > Since MacroAssembler added merge_ldst, we can use different > destination registers for contiguous-memory LDR instructions to improve performance. This pull request has now been integrated. Changeset: 08400f18 Author: Zhuojun Miao Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/08400f18badb23ea3d00282e8b71e76844398a67 Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod 8287349: AArch64: Merge LDR instructions to improve C1 OSR performance Reviewed-by: aph, ngasson, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/8933 From aph at openjdk.java.net Wed Jun 15 06:24:41 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 15 Jun 2022 06:24:41 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) In-Reply-To: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> Message-ID: On Wed, 15 Jun 2022 04:07:47 GMT, Hao Sun wrote: > The assertion, i.e. src and dst must be different registers, was > introduced years ago. But I don't think it's needed. > > This limitation was added in [1]. Frankly speaking, I don't know the > reason. But I guess the assertion is probably used for debugging, > raising one warning of fmovs/fmovd usage in the scenario of moving > element at index zero from one **vector** register, to one float-point > scalar register. If the "src" vector register and the "dst" float-point > scalar register are the same one, it introduces a side-effect, i.e. the > higher bits are cleared to zeros[2]. > > If so, I argue that > 1) the assembler should align with the ISA. > 2) compiler developers should be aware of the side-effect when they want > to use fmovs/fmovd, and they should guarantee "dst != src" if they like > to higher bits untouched, e.g., [3]. > > Hence, I think we can remove this unnecessary assertion. > > [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 > [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 1935: > 1933: INSN(fmovs, 0b000, 0b00, 0b000000); > 1934: INSN(fabss, 0b000, 0b00, 0b000001); > 1935: INSN(fnegs, 0b000, 0b00, 0b000010); There's unnecessary whitespace here. Suggestion: INSN(fmovs, 0b000, 0b00, 0b000000); INSN(fabss, 0b000, 0b00, 0b000001); INSN(fnegs, 0b000, 0b00, 0b000010); src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 1941: > 1939: INSN(fmovd, 0b000, 0b01, 0b000000); > 1940: INSN(fabsd, 0b000, 0b01, 0b000001); > 1941: INSN(fnegd, 0b000, 0b01, 0b000010); Suggestion: INSN(fmovd, 0b000, 0b01, 0b000000); INSN(fabsd, 0b000, 0b01, 0b000001); INSN(fnegd, 0b000, 0b01, 0b000010); ------------- PR: https://git.openjdk.org/jdk/pull/9163 From aph at openjdk.java.net Wed Jun 15 06:28:38 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 15 Jun 2022 06:28:38 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) In-Reply-To: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> Message-ID: <9VCCVjAjqefkVQ20-Yj7Sy9f-HTlGOmn6ho1zUZ6xdw=.c08d3dfb-cd88-47bb-88ea-4f1dc158091f@github.com> On Wed, 15 Jun 2022 04:07:47 GMT, Hao Sun wrote: > The assertion, i.e. src and dst must be different registers, was > introduced years ago. But I don't think it's needed. > > This limitation was added in [1]. Frankly speaking, I don't know the > reason. But I guess the assertion is probably used for debugging, > raising one warning of fmovs/fmovd usage in the scenario of moving > element at index zero from one **vector** register, to one float-point > scalar register. If the "src" vector register and the "dst" float-point > scalar register are the same one, it introduces a side-effect, i.e. the > higher bits are cleared to zeros[2]. > > If so, I argue that > 1) the assembler should align with the ISA. > 2) compiler developers should be aware of the side-effect when they want > to use fmovs/fmovd, and they should guarantee "dst != src" if they like > to higher bits untouched, e.g., [3]. > > Hence, I think we can remove this unnecessary assertion. > > [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 > [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 I don't remember, but I suspect that such moves are probably a sign that there is a bug elsewhere in the port, perhaps a misnamed register or a buggy compiler. I think the assertions are no more than a warning to the developer that something may be wrong elsewhere. I am happy for the assertions to be removed. As you say, there is no such architectural restriction that requires them. ------------- PR: https://git.openjdk.org/jdk/pull/9163 From haosun at openjdk.java.net Wed Jun 15 06:41:39 2022 From: haosun at openjdk.java.net (Hao Sun) Date: Wed, 15 Jun 2022 06:41:39 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) In-Reply-To: References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> Message-ID: On Wed, 15 Jun 2022 06:20:12 GMT, Andrew Haley wrote: >> The assertion, i.e. src and dst must be different registers, was >> introduced years ago. But I don't think it's needed. >> >> This limitation was added in [1]. Frankly speaking, I don't know the >> reason. But I guess the assertion is probably used for debugging, >> raising one warning of fmovs/fmovd usage in the scenario of moving >> element at index zero from one **vector** register, to one float-point >> scalar register. If the "src" vector register and the "dst" float-point >> scalar register are the same one, it introduces a side-effect, i.e. the >> higher bits are cleared to zeros[2]. >> >> If so, I argue that >> 1) the assembler should align with the ISA. >> 2) compiler developers should be aware of the side-effect when they want >> to use fmovs/fmovd, and they should guarantee "dst != src" if they like >> to higher bits untouched, e.g., [3]. >> >> Hence, I think we can remove this unnecessary assertion. >> >> [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 >> [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 1935: > >> 1933: INSN(fmovs, 0b000, 0b00, 0b000000); >> 1934: INSN(fabss, 0b000, 0b00, 0b000001); >> 1935: INSN(fnegs, 0b000, 0b00, 0b000010); > > There's unnecessary whitespace here. > Suggestion: > > INSN(fmovs, 0b000, 0b00, 0b000000); > INSN(fabss, 0b000, 0b00, 0b000001); > INSN(fnegs, 0b000, 0b00, 0b000010); Thanks for your review. It's a style issue. I added one whitespace for `fmov/fabs/fneg` and the later `fcvt` so as to align the arguments to those of `fsqrt`. I saw such a style in several other sites in this header. If you don't like it, I can remove the added whitespace. But I guess you may also want me to remove the whitespace I added for the `fcvt` below. ------------- PR: https://git.openjdk.org/jdk/pull/9163 From aph at openjdk.java.net Wed Jun 15 06:56:45 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 15 Jun 2022 06:56:45 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) In-Reply-To: References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> Message-ID: On Wed, 15 Jun 2022 06:37:48 GMT, Hao Sun wrote: >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 1935: >> >>> 1933: INSN(fmovs, 0b000, 0b00, 0b000000); >>> 1934: INSN(fabss, 0b000, 0b00, 0b000001); >>> 1935: INSN(fnegs, 0b000, 0b00, 0b000010); >> >> There's unnecessary whitespace here. >> Suggestion: >> >> INSN(fmovs, 0b000, 0b00, 0b000000); >> INSN(fabss, 0b000, 0b00, 0b000001); >> INSN(fnegs, 0b000, 0b00, 0b000010); > > Thanks for your review. > It's a style issue. I added one whitespace for `fmov/fabs/fneg` and the later `fcvt` so as to align the arguments to those of `fsqrt`. I saw such a style in several other sites in this header. > > If you don't like it, I can remove the added whitespace. But I guess you may also want me to remove the whitespace I added for the `fcvt` below. I see. You're right, we're not consistent about this in the file. There are advantages and disadvantages. On the one hand it's a little easier to read, but on the other it's more work to maintain and tends to get eroded over time. Whitespace-only changes as part of a larger patch can be hard to review , because they ca hide the substantive changes. Please take out the added whitespace in this patch, and submit just the substantive changes. I will think about what to do to make this file consistent. ------------- PR: https://git.openjdk.org/jdk/pull/9163 From chagedorn at openjdk.java.net Wed Jun 15 07:46:43 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 15 Jun 2022 07:46:43 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v5] In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Tue, 14 Jun 2022 08:16:36 GMT, Emanuel Peter wrote: >> **Goal** >> Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. >> >> **Proposal** >> Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). >> >> Thus, I present these additional functions: >> `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. >> `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. >> The nodes are sorted by node idx, and then dumped. >> >> Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) >> >> **Usecase** >> Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. >> >> Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. >> Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. >> Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. >> >> Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. >> >> You can probably find more usecases yourself ;) > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: > > - remove duplicate node_idx_cmp after merge > - Merge branch 'master' into JDK-8287647 > - adding resource marks, moving Unique_Mixed_Node_List to node.hpp > - fix header issues > - changes responding to review by Christian and Roberto > - style fixes, and implemented case insensitive strstr > - make matching case insensitive > - missing include > - ensure null termination > - guard against long pattern, and fix array/pointer issues > - ... and 4 more: https://git.openjdk.org/jdk/compare/877b4f33...653a7cc3 Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8988 From haosun at openjdk.java.net Wed Jun 15 07:56:38 2022 From: haosun at openjdk.java.net (Hao Sun) Date: Wed, 15 Jun 2022 07:56:38 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) [v2] In-Reply-To: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> Message-ID: <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> > The assertion, i.e. src and dst must be different registers, was > introduced years ago. But I don't think it's needed. > > This limitation was added in [1]. Frankly speaking, I don't know the > reason. But I guess the assertion is probably used for debugging, > raising one warning of fmovs/fmovd usage in the scenario of moving > element at index zero from one **vector** register, to one float-point > scalar register. If the "src" vector register and the "dst" float-point > scalar register are the same one, it introduces a side-effect, i.e. the > higher bits are cleared to zeros[2]. > > If so, I argue that > 1) the assembler should align with the ISA. > 2) compiler developers should be aware of the side-effect when they want > to use fmovs/fmovd, and they should guarantee "dst != src" if they like > to higher bits untouched, e.g., [3]. > > Hence, I think we can remove this unnecessary assertion. > > [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 > [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 Hao Sun has updated the pull request incrementally with one additional commit since the last revision: Remove unnecessary whitespace As suggested by aph, remove the unnecessary whitespace and keep the substantive changes only. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9163/files - new: https://git.openjdk.org/jdk/pull/9163/files/58361d8b..4949c37e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9163&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9163&range=00-01 Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9163.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9163/head:pull/9163 PR: https://git.openjdk.org/jdk/pull/9163 From haosun at openjdk.java.net Wed Jun 15 07:56:39 2022 From: haosun at openjdk.java.net (Hao Sun) Date: Wed, 15 Jun 2022 07:56:39 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) [v2] In-Reply-To: References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> Message-ID: On Wed, 15 Jun 2022 06:53:25 GMT, Andrew Haley wrote: >> Thanks for your review. >> It's a style issue. I added one whitespace for `fmov/fabs/fneg` and the later `fcvt` so as to align the arguments to those of `fsqrt`. I saw such a style in several other sites in this header. >> >> If you don't like it, I can remove the added whitespace. But I guess you may also want me to remove the whitespace I added for the `fcvt` below. > > I see. You're right, we're not consistent about this in the file. > > There are advantages and disadvantages. On the one hand it's a little easier to read, but on the other it's more work to maintain and tends to get eroded over time. > > Whitespace-only changes as part of a larger patch can be hard to review , because they ca hide the substantive changes. > > Please take out the added whitespace in this patch, and submit just the substantive changes. I will think about what to do to make this file consistent. Updated in the latest revision. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9163 From epeter at openjdk.java.net Wed Jun 15 08:09:49 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Wed, 15 Jun 2022 08:09:49 GMT Subject: RFR: 8287647: VM debug support: find node by pattern in name or dump [v3] In-Reply-To: References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Tue, 14 Jun 2022 07:04:17 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix header issues > > Very nice! Thanks @TobiHartmann @robcasloz @chhagedorn @vnkozlov For the reviews and suggestions for improvement. ------------- PR: https://git.openjdk.org/jdk/pull/8988 From epeter at openjdk.java.net Wed Jun 15 08:09:50 2022 From: epeter at openjdk.java.net (Emanuel Peter) Date: Wed, 15 Jun 2022 08:09:50 GMT Subject: Integrated: 8287647: VM debug support: find node by pattern in name or dump In-Reply-To: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> References: <6dwMBFImj6Ev_XieTRj9zN1i5srnqPbuB5Jxm9TqjpY=.253fc8a2-eea8-4d7d-93a8-d3efbbcc5e59@github.com> Message-ID: On Thu, 2 Jun 2022 09:16:28 GMT, Emanuel Peter wrote: > **Goal** > Refactor `Node::find`, allow not just searching for `node->_idx`, but also matching for `node->Name()` and even `node->dump()`. > > **Proposal** > Refactor `Node::find` into `visit_nodes`, which visits all nodes and calls a `callback` on them. This callback can be used to filter by `idx` (`find_node_by_idx`, `Node::find`, `find_node` etc.). It can also be used to match node names (`find_node_by_name`) and even node dump (`find_node_by_dump`). > > Thus, I present these additional functions: > `Node* find_node_by_name(const char* name)`: find all nodes matching the `name` pattern. > `Node* find_node_by_dump(const char* pattern)`: find all nodes matching the `pattern`. > The nodes are sorted by node idx, and then dumped. > > Patterns can contain `*` characters to match any characters (eg. `Con*L` matches both `ConL` and `ConvI2L`) > > **Usecase** > Find all `CastII` nodes. Find all `Loop` nodes. Use `find_node_by_name`. > > Find all all `CastII` nodes that depend on a rangecheck. Use `find_node_by_dump("CastII*range check dependency")`. > Find all `Bool` nodes that perform a `[ne]` check. Use `find_node_by_dump("Bool*[ne]")`. > Find all `Phi` nodes that are `tripcount`. Use `find_node_by_dump("Phi*tripcount")`. > > Find all `Load` nodes that are associated with line 301 in some file. Use `find_node_by_dump("Load*line 301")`. > > You can probably find more usecases yourself ;) This pull request has now been integrated. Changeset: 2471f8f7 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/2471f8f7c56dfc1b8de287cb990121d30976ba36 Stats: 260 lines in 4 files changed: 208 ins; 51 del; 1 mod 8287647: VM debug support: find node by pattern in name or dump Reviewed-by: kvn, chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/8988 From aph at openjdk.java.net Wed Jun 15 08:15:39 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 15 Jun 2022 08:15:39 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) [v2] In-Reply-To: <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> Message-ID: On Wed, 15 Jun 2022 07:56:38 GMT, Hao Sun wrote: >> The assertion, i.e. src and dst must be different registers, was >> introduced years ago. But I don't think it's needed. >> >> This limitation was added in [1]. Frankly speaking, I don't know the >> reason. But I guess the assertion is probably used for debugging, >> raising one warning of fmovs/fmovd usage in the scenario of moving >> element at index zero from one **vector** register, to one float-point >> scalar register. If the "src" vector register and the "dst" float-point >> scalar register are the same one, it introduces a side-effect, i.e. the >> higher bits are cleared to zeros[2]. >> >> If so, I argue that >> 1) the assembler should align with the ISA. >> 2) compiler developers should be aware of the side-effect when they want >> to use fmovs/fmovd, and they should guarantee "dst != src" if they like >> to higher bits untouched, e.g., [3]. >> >> Hence, I think we can remove this unnecessary assertion. >> >> [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 >> [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 > > Hao Sun has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary whitespace > > As suggested by aph, remove the unnecessary whitespace and keep the > substantive changes only. Thanks. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/9163 From rcastanedalo at openjdk.java.net Wed Jun 15 09:12:16 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 15 Jun 2022 09:12:16 GMT Subject: RFR: 8288421: IGV: warn user about all unreachable nodes Message-ID: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> This changeset ensures that, when approximating C2's schedule, IGV does not schedule unreachable nodes. Instead, a node warning is emitted, informing the user that the corresponding node is unreachable. This information can be useful when debugging ill-formed graphs. The following clustered subgraph illustrates the proposed change: ![before-after](https://user-images.githubusercontent.com/8792647/173784252-5fccb80b-7c36-49bf-8c52-eed502cc129c.png) Currently _(before)_, `522 IfFalse` gets assigned the same block as `256 Region` (`B11`) in an effort to schedule as many nodes as possible, and hence no warning is emitted for `522 IfFalse`, even though it is clearly control-unreachable (since it is a child of `520 If` which is control-unreachable). This changeset _(after)_ leaves instead `522 IfFalse` unscheduled and emits a "Control-unreachable CFG node" warning for it (visible as a tooltip of the node warning sign). As a side-benefit, the changeset simplifies the IGV scheduling algorithm by removing the code that tries to schedule unrechable nodes code on a best-effort basis, and adds two additional node warnings ("Region with multiple successors" and "CFG node without control successors") to highlight the new nodes that might remain unscheduled as a consequence. #### Testing - Tested manually on the [graph](https://bugs.openjdk.org/secure/attachment/99555/graph.xml) attached to the JBS issue. - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does trigger any exception or assertion failure. ------------- Commit messages: - Do not schedule unreachable CFG nodes, warn instead Changes: https://git.openjdk.org/jdk/pull/9164/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9164&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288421 Stats: 91 lines in 1 file changed: 11 ins; 79 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9164.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9164/head:pull/9164 PR: https://git.openjdk.org/jdk/pull/9164 From chagedorn at openjdk.java.net Wed Jun 15 09:24:38 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 15 Jun 2022 09:24:38 GMT Subject: RFR: 8288421: IGV: warn user about all unreachable nodes In-Reply-To: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> References: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> Message-ID: On Wed, 15 Jun 2022 08:50:39 GMT, Roberto Casta?eda Lozano wrote: > This changeset ensures that, when approximating C2's schedule, IGV does not schedule unreachable nodes. Instead, a node warning is emitted, informing the user that the corresponding node is unreachable. This information can be useful when debugging ill-formed graphs. > > The following clustered subgraph illustrates the proposed change: > > ![before-after](https://user-images.githubusercontent.com/8792647/173784252-5fccb80b-7c36-49bf-8c52-eed502cc129c.png) > > Currently _(before)_, `522 IfFalse` gets assigned the same block as `256 Region` (`B11`) in an effort to schedule as many nodes as possible, and hence no warning is emitted for `522 IfFalse`, even though it is clearly control-unreachable (since it is a child of `520 If` which is control-unreachable). This changeset _(after)_ leaves instead `522 IfFalse` unscheduled and emits a "Control-unreachable CFG node" warning for it (visible as a tooltip of the node warning sign). > > As a side-benefit, the changeset simplifies the IGV scheduling algorithm by removing the code that tries to schedule unrechable nodes code on a best-effort basis, and adds two additional node warnings ("Region with multiple successors" and "CFG node without control successors") to highlight the new nodes that might remain unscheduled as a consequence. > > #### Testing > > - Tested manually on the [graph](https://bugs.openjdk.org/secure/attachment/99555/graph.xml) attached to the JBS issue. > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does trigger any exception or assertion failure. Looks good, thanks for fixing this! And nice new warnings. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9164 From rcastanedalo at openjdk.java.net Wed Jun 15 09:38:39 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 15 Jun 2022 09:38:39 GMT Subject: RFR: 8288421: IGV: warn user about all unreachable nodes In-Reply-To: References: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> Message-ID: On Wed, 15 Jun 2022 09:21:16 GMT, Christian Hagedorn wrote: > Looks good, thanks for fixing this! And nice new warnings. Thanks for reviewing, Christian. ------------- PR: https://git.openjdk.org/jdk/pull/9164 From xgong at openjdk.java.net Wed Jun 15 09:49:22 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 15 Jun 2022 09:49:22 GMT Subject: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules Message-ID: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. ------------- Commit messages: - 8288397: AArch64: Fix register issues in SVE backend match rules Changes: https://git.openjdk.org/jdk19/pull/17/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=17&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288397 Stats: 334 lines in 4 files changed: 274 ins; 0 del; 60 mod Patch: https://git.openjdk.org/jdk19/pull/17.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/17/head:pull/17 PR: https://git.openjdk.org/jdk19/pull/17 From rcastanedalo at openjdk.java.net Wed Jun 15 12:58:04 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 15 Jun 2022 12:58:04 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus Message-ID: This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://netbeans.apache.org/tutorials/nbm-selection-2.html). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. ------------- Commit messages: - Highlight focused graph in outline Changes: https://git.openjdk.org/jdk/pull/9167/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9167&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8263384 Stats: 56 lines in 2 files changed: 50 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9167.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9167/head:pull/9167 PR: https://git.openjdk.org/jdk/pull/9167 From rcastanedalo at openjdk.java.net Wed Jun 15 14:00:37 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 15 Jun 2022 14:00:37 GMT Subject: RFR: 8288480: IGV: toolbar action is not applied to the focused graph Message-ID: When multiple graphs are displayed simultaneously in split windows, the following toolbar actions are always applied to the same graph, regardless of which graph window is focused: - search nodes and blocks - extract node - hide node - show all nodes - zoom in - zoom out This changeset ensures that each of the above actions is only applied within its corresponding graph window. This is achieved by applying the actions to the graph window that is currently activated (`EditorTopComponent.getRegistry().getActivated()`) instead of the first matching occurrence in `WindowManager.getDefault().getModes()`. The changeset makes it practical, for example, to explore different views of the same graph simultaneously, as illustrated here: ![multi-view](https://user-images.githubusercontent.com/8792647/173841115-084c6396-3843-4d9b-9951-f93c932100c3.png) Tested manually by triggering the above actions within multiple split graph windows and asserting that they are only applied to their corresponding graphs. ------------- Commit messages: - Apply toolbar actions to the graph window that is actually active Changes: https://git.openjdk.org/jdk/pull/9169/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9169&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288480 Stats: 8 lines in 1 file changed: 0 ins; 7 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9169.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9169/head:pull/9169 PR: https://git.openjdk.org/jdk/pull/9169 From ngasson at openjdk.java.net Wed Jun 15 14:24:58 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Wed, 15 Jun 2022 14:24:58 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) [v2] In-Reply-To: <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> Message-ID: On Wed, 15 Jun 2022 07:56:38 GMT, Hao Sun wrote: >> The assertion, i.e. src and dst must be different registers, was >> introduced years ago. But I don't think it's needed. >> >> This limitation was added in [1]. Frankly speaking, I don't know the >> reason. But I guess the assertion is probably used for debugging, >> raising one warning of fmovs/fmovd usage in the scenario of moving >> element at index zero from one **vector** register, to one float-point >> scalar register. If the "src" vector register and the "dst" float-point >> scalar register are the same one, it introduces a side-effect, i.e. the >> higher bits are cleared to zeros[2]. >> >> If so, I argue that >> 1) the assembler should align with the ISA. >> 2) compiler developers should be aware of the side-effect when they want >> to use fmovs/fmovd, and they should guarantee "dst != src" if they like >> to higher bits untouched, e.g., [3]. >> >> Hence, I think we can remove this unnecessary assertion. >> >> [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 >> [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 > > Hao Sun has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary whitespace > > As suggested by aph, remove the unnecessary whitespace and keep the > substantive changes only. Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9163 From thartmann at openjdk.java.net Wed Jun 15 15:02:59 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 15 Jun 2022 15:02:59 GMT Subject: RFR: 8288421: IGV: warn user about all unreachable nodes In-Reply-To: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> References: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> Message-ID: On Wed, 15 Jun 2022 08:50:39 GMT, Roberto Casta?eda Lozano wrote: > This changeset ensures that, when approximating C2's schedule, IGV does not schedule unreachable nodes. Instead, a node warning is emitted, informing the user that the corresponding node is unreachable. This information can be useful when debugging ill-formed graphs. > > The following clustered subgraph illustrates the proposed change: > > ![before-after](https://user-images.githubusercontent.com/8792647/173784252-5fccb80b-7c36-49bf-8c52-eed502cc129c.png) > > Currently _(before)_, `522 IfFalse` gets assigned the same block as `256 Region` (`B11`) in an effort to schedule as many nodes as possible, and hence no warning is emitted for `522 IfFalse`, even though it is clearly control-unreachable (since it is a child of `520 If` which is control-unreachable). This changeset _(after)_ leaves instead `522 IfFalse` unscheduled and emits a "Control-unreachable CFG node" warning for it (visible as a tooltip of the node warning sign). > > As a side-benefit, the changeset simplifies the IGV scheduling algorithm by removing the code that tries to schedule unrechable nodes code on a best-effort basis, and adds two additional node warnings ("Region with multiple successors" and "CFG node without control successors") to highlight the new nodes that might remain unscheduled as a consequence. > > #### Testing > > - Tested manually on the [graph](https://bugs.openjdk.org/secure/attachment/99555/graph.xml) attached to the JBS issue. > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does trigger any exception or assertion failure. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9164 From aph at openjdk.java.net Wed Jun 15 15:49:00 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 15 Jun 2022 15:49:00 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) [v2] In-Reply-To: <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> Message-ID: <8ieCxKguyyyQX3GBzGmMZFV6ONik_t71GeSQh16tdUg=.034444af-2c21-4e4c-8afc-fa8adc7f71c1@github.com> On Wed, 15 Jun 2022 07:56:38 GMT, Hao Sun wrote: >> The assertion, i.e. src and dst must be different registers, was >> introduced years ago. But I don't think it's needed. >> >> This limitation was added in [1]. Frankly speaking, I don't know the >> reason. But I guess the assertion is probably used for debugging, >> raising one warning of fmovs/fmovd usage in the scenario of moving >> element at index zero from one **vector** register, to one float-point >> scalar register. If the "src" vector register and the "dst" float-point >> scalar register are the same one, it introduces a side-effect, i.e. the >> higher bits are cleared to zeros[2]. >> >> If so, I argue that >> 1) the assembler should align with the ISA. >> 2) compiler developers should be aware of the side-effect when they want >> to use fmovs/fmovd, and they should guarantee "dst != src" if they like >> to higher bits untouched, e.g., [3]. >> >> Hence, I think we can remove this unnecessary assertion. >> >> [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 >> [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 > > Hao Sun has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary whitespace > > As suggested by aph, remove the unnecessary whitespace and keep the > substantive changes only. @shqking , please do the `integrate` and then my patch with the whitespace alignment fixes will follow. ------------- PR: https://git.openjdk.org/jdk/pull/9163 From sviswanathan at openjdk.java.net Wed Jun 15 18:55:15 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 15 Jun 2022 18:55:15 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v9] In-Reply-To: <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> Message-ID: On Tue, 14 Jun 2022 01:49:34 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Add an IR framework testcase > > Change-Id: Ifbcc8d233aa27dfe93acef548c7e42721d86376e > - Merge branch 'master' into fg8283091 > > Change-Id: I9525ae9310c3c493da29490d034cbb8f223e7f80 > - Update to the latest JDK and fix the function name > > Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc > - Merge branch 'master' into fg8283091 > > Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 > - Add assertion line for opcode() and withdraw some common code as a function > > Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe > - Merge branch 'master' into fg8283091 > > Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - ... and 3 more: https://git.openjdk.org/jdk/compare/f1143b1b...49e6f56e Marked as reviewed by sviswanathan (Reviewer). Looks good to me. ------------- PR: https://git.openjdk.org/jdk/pull/7806 From haosun at openjdk.java.net Wed Jun 15 22:03:10 2022 From: haosun at openjdk.java.net (Hao Sun) Date: Wed, 15 Jun 2022 22:03:10 GMT Subject: RFR: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) [v2] In-Reply-To: <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> <7PKCXLBWJ484BiZD-tfXxoi-nfxFz_ZzfbwyJp7LsSc=.4f41f14f-0e5e-411a-b7bf-91d093fd10ea@github.com> Message-ID: On Wed, 15 Jun 2022 07:56:38 GMT, Hao Sun wrote: >> The assertion, i.e. src and dst must be different registers, was >> introduced years ago. But I don't think it's needed. >> >> This limitation was added in [1]. Frankly speaking, I don't know the >> reason. But I guess the assertion is probably used for debugging, >> raising one warning of fmovs/fmovd usage in the scenario of moving >> element at index zero from one **vector** register, to one float-point >> scalar register. If the "src" vector register and the "dst" float-point >> scalar register are the same one, it introduces a side-effect, i.e. the >> higher bits are cleared to zeros[2]. >> >> If so, I argue that >> 1) the assembler should align with the ISA. >> 2) compiler developers should be aware of the side-effect when they want >> to use fmovs/fmovd, and they should guarantee "dst != src" if they like >> to higher bits untouched, e.g., [3]. >> >> Hence, I think we can remove this unnecessary assertion. >> >> [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 >> [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 > > Hao Sun has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary whitespace > > As suggested by aph, remove the unnecessary whitespace and keep the > substantive changes only. Thanks for your review! ------------- PR: https://git.openjdk.org/jdk/pull/9163 From haosun at openjdk.java.net Thu Jun 16 00:57:27 2022 From: haosun at openjdk.java.net (Hao Sun) Date: Thu, 16 Jun 2022 00:57:27 GMT Subject: Integrated: 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) In-Reply-To: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> References: <1R813tq7pgYPfq4WvuPS1VObl5tiVrOKLzSgmjjGL9A=.77aabb09-29ea-48f5-bb81-a5b55a428d5e@github.com> Message-ID: On Wed, 15 Jun 2022 04:07:47 GMT, Hao Sun wrote: > The assertion, i.e. src and dst must be different registers, was > introduced years ago. But I don't think it's needed. > > This limitation was added in [1]. Frankly speaking, I don't know the > reason. But I guess the assertion is probably used for debugging, > raising one warning of fmovs/fmovd usage in the scenario of moving > element at index zero from one **vector** register, to one float-point > scalar register. If the "src" vector register and the "dst" float-point > scalar register are the same one, it introduces a side-effect, i.e. the > higher bits are cleared to zeros[2]. > > If so, I argue that > 1) the assembler should align with the ISA. > 2) compiler developers should be aware of the side-effect when they want > to use fmovs/fmovd, and they should guarantee "dst != src" if they like > to higher bits untouched, e.g., [3]. > > Hence, I think we can remove this unnecessary assertion. > > [1] http://hg.openjdk.java.net/aarch64-port/jdk8/hotspot/rev/9baee4e65ac5 > [2] https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMOV--register---Floating-point-Move-register-without-conversion-?lang=en > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L4899 This pull request has now been integrated. Changeset: f7ba3b7e Author: Hao Sun Committer: Ningsheng Jian URL: https://git.openjdk.org/jdk/commit/f7ba3b7e422c0a4b899b7aa11d0f903e6c1614a9 Stats: 16 lines in 1 file changed: 0 ins; 14 del; 2 mod 8288300: AArch64: Remove the assertion in fmovs/fmovd(FloatRegister, FloatRegister) Reviewed-by: aph, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/9163 From fgao at openjdk.java.net Thu Jun 16 01:31:14 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 16 Jun 2022 01:31:14 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v9] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> Message-ID: On Wed, 15 Jun 2022 18:51:14 GMT, Sandhya Viswanathan wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: >> >> - Add an IR framework testcase >> >> Change-Id: Ifbcc8d233aa27dfe93acef548c7e42721d86376e >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I9525ae9310c3c493da29490d034cbb8f223e7f80 >> - Update to the latest JDK and fix the function name >> >> Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 >> - Add assertion line for opcode() and withdraw some common code as a function >> >> Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 >> - Implement an interface for auto-vectorization to consult supported match rules >> >> Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 >> - Merge branch 'master' into fg8283091 >> >> Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 >> - ... and 3 more: https://git.openjdk.org/jdk/compare/f1143b1b...49e6f56e > > Looks good to me. Thanks for your review @sviswa7 @vnkozlov . Can I integrate it now? ------------- PR: https://git.openjdk.org/jdk/pull/7806 From kvn at openjdk.java.net Thu Jun 16 02:09:10 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Jun 2022 02:09:10 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v9] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> Message-ID: On Wed, 15 Jun 2022 18:51:14 GMT, Sandhya Viswanathan wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: >> >> - Add an IR framework testcase >> >> Change-Id: Ifbcc8d233aa27dfe93acef548c7e42721d86376e >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I9525ae9310c3c493da29490d034cbb8f223e7f80 >> - Update to the latest JDK and fix the function name >> >> Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 >> - Add assertion line for opcode() and withdraw some common code as a function >> >> Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 >> - Implement an interface for auto-vectorization to consult supported match rules >> >> Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 >> - Merge branch 'master' into fg8283091 >> >> Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f >> - Merge branch 'master' into fg8283091 >> >> Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 >> - ... and 3 more: https://git.openjdk.org/jdk/compare/f1143b1b...49e6f56e > > Looks good to me. > Thanks for your review @sviswa7 @vnkozlov . > > Can I integrate it now? Yes ------------- PR: https://git.openjdk.org/jdk/pull/7806 From fgao at openjdk.java.net Thu Jun 16 02:44:16 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 16 Jun 2022 02:44:16 GMT Subject: Integrated: 8283091: Support type conversion between different data sizes in SLP In-Reply-To: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Mon, 14 Mar 2022 08:18:25 GMT, Fei Gao wrote: > After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. > > After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). > > After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. > > Here is the test data (-XX:+UseSuperWord) on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > convertL2I 523 avgt 15 48.183 ? 0.990 ns/op This pull request has now been integrated. Changeset: a1795901 Author: Fei Gao Committer: Pengfei Li URL: https://git.openjdk.org/jdk/commit/a1795901ee292fa6272768cef2fedcaaf8044074 Stats: 1379 lines in 23 files changed: 1320 ins; 13 del; 46 mod 8283091: Support type conversion between different data sizes in SLP Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/7806 From sviswanathan at openjdk.java.net Thu Jun 16 02:44:37 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 16 Jun 2022 02:44:37 GMT Subject: RFR: 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." Message-ID: The IR Framework test was failing due to incorrect node name. Corrected the IR node name check. Please review. Best Regards, Sandhya ------------- Commit messages: - 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." Changes: https://git.openjdk.org/jdk/pull/9177/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9177&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288281 Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9177.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9177/head:pull/9177 PR: https://git.openjdk.org/jdk/pull/9177 From thartmann at openjdk.java.net Thu Jun 16 05:11:05 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 16 Jun 2022 05:11:05 GMT Subject: RFR: 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 02:34:33 GMT, Sandhya Viswanathan wrote: > The IR Framework test was failing due to incorrect node name. > Corrected the IR node name check. > > Please review. > > Best Regards, > Sandhya Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9177 From njian at openjdk.java.net Thu Jun 16 06:03:03 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Thu, 16 Jun 2022 06:03:03 GMT Subject: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules In-Reply-To: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> References: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> Message-ID: On Wed, 15 Jun 2022 09:40:52 GMT, Xiaohong Gong wrote: > There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. > > The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". > > And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. Thanks for the fix! ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.org/jdk19/pull/17 From xliu at openjdk.java.net Thu Jun 16 06:46:02 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 16 Jun 2022 06:46:02 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus In-Reply-To: References: Message-ID: On Wed, 15 Jun 2022 12:47:54 GMT, Roberto Casta?eda Lozano wrote: > This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://netbeans.apache.org/tutorials/nbm-selection-2.html). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. > > Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/OutlineTopComponent.java line 221: > 219: public void resultChanged(LookupEvent lookupEvent) { > 220: // Highlight the focused graph, if available, in the outline. > 221: if (result.allItems().isEmpty()) { It looks like you are confident that result is not NULL when resultChanged() is called. I believe framework calls it after `componentOpened()`. if so, is it possible to remove result = null in `componentClosed()`? src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/OutlineTopComponent.java line 221: > 219: public void resultChanged(LookupEvent lookupEvent) { > 220: // Highlight the focused graph, if available, in the outline. > 221: if (result.allItems().isEmpty()) { It looks like you are confident that result is not NULL when resultChanged() is called. I believe framework calls it after `componentOpened()`. if so, is it possible to remove result = null in `componentClosed()`? ------------- PR: https://git.openjdk.org/jdk/pull/9167 From rcastanedalo at openjdk.java.net Thu Jun 16 07:04:07 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 16 Jun 2022 07:04:07 GMT Subject: RFR: 8288421: IGV: warn user about all unreachable nodes In-Reply-To: References: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> Message-ID: On Wed, 15 Jun 2022 09:35:04 GMT, Roberto Casta?eda Lozano wrote: > Looks good. Thanks for reviewing, Tobias! ------------- PR: https://git.openjdk.org/jdk/pull/9164 From xliu at openjdk.java.net Thu Jun 16 07:07:10 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 16 Jun 2022 07:07:10 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus In-Reply-To: References: Message-ID: On Wed, 15 Jun 2022 12:47:54 GMT, Roberto Casta?eda Lozano wrote: > This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://netbeans.apache.org/tutorials/nbm-selection-2.html). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. > > Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/OutlineTopComponent.java line 229: > 227: } > 228: try { > 229: manager.setSelectedNodes(new GraphNode[]{FolderNode.getGraphNode(p.getGraph())}); Do we need to consider that FolderNode.getGraphNode() returns null? `setSelectedNodes` will throw `IllegalArgumentException` if input is null? src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/OutlineTopComponent.java line 229: > 227: } > 228: try { > 229: manager.setSelectedNodes(new GraphNode[]{FolderNode.getGraphNode(p.getGraph())}); Do we need to consider that FolderNode.getGraphNode() returns null? `setSelectedNodes` will throw `IllegalArgumentException` if input is null? ------------- PR: https://git.openjdk.org/jdk/pull/9167 From xliu at openjdk.java.net Thu Jun 16 07:12:26 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 16 Jun 2022 07:12:26 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus In-Reply-To: References: Message-ID: On Wed, 15 Jun 2022 12:47:54 GMT, Roberto Casta?eda Lozano wrote: > This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://netbeans.apache.org/tutorials/nbm-selection-2.html). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. > > Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. hi, @robcasloz, This patch works perfectly on MacOS. --lx hi, @robcasloz, This patch works perfectly on MacOS. --lx src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/FolderNode.java line 80: > 78: for (Node n : nodes) { > 79: // Each node is only present once in the graphNode map. > 80: graphNode.values().remove(n); Is `destroyNodes()` thread-safe here? graphNode is a HashMap instead of ConcurrentHashMap. src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/FolderNode.java line 80: > 78: for (Node n : nodes) { > 79: // Each node is only present once in the graphNode map. > 80: graphNode.values().remove(n); Is `destroyNodes()` thread-safe here? graphNode is a HashMap instead of ConcurrentHashMap. ------------- PR: https://git.openjdk.org/jdk/pull/9167 From chagedorn at openjdk.java.net Thu Jun 16 07:16:10 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Jun 2022 07:16:10 GMT Subject: RFR: 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 02:34:33 GMT, Sandhya Viswanathan wrote: > The IR Framework test was failing due to incorrect node name. > Corrected the IR node name check. > > Please review. > > Best Regards, > Sandhya Otherwise, looks good! test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 87: > 85: > 86: @Test > 87: @IR(counts = {"VectorCastF2X", "> 0"}) You can directly use `IRNode.VECTOR_CAST_F2X` (and `IRNode.VECTOR_CAST_D2X` below) which makes it clearer that this is an actual IR node and not a custom string. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9177 From xliu at openjdk.java.net Thu Jun 16 07:16:12 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 16 Jun 2022 07:16:12 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 07:07:24 GMT, Xin Liu wrote: >> This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://netbeans.apache.org/tutorials/nbm-selection-2.html). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. >> >> Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. > > src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/FolderNode.java line 80: > >> 78: for (Node n : nodes) { >> 79: // Each node is only present once in the graphNode map. >> 80: graphNode.values().remove(n); > > Is `destroyNodes()` thread-safe here? graphNode is a HashMap instead of ConcurrentHashMap. This method is by the inner class of RemoveCookie(). I am okay if the framwork guarantees to execute RemoveCookie sequentially. ------------- PR: https://git.openjdk.org/jdk/pull/9167 From rcastanedalo at openjdk.org Thu Jun 16 10:22:33 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 16 Jun 2022 10:22:33 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus [v2] In-Reply-To: References: Message-ID: > This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://urldefense.com/v3/__https://netbeans.apache.org/tutorials/nbm-selection-2.html__;!!ACWV5N9M2RV99hQ!Mw4PI34f8wli3r-QHSz0QzAEBTmXpgFPWoBAIsW8Dxhk7uqBQ26lVRPb2YNL0_3lRQ2c2_rgMSDC5Pph8EMyGL6jFYW1X1JmJLFisA$ ). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. > > Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. Roberto Casta?eda Lozano has updated the pull request incrementally with three additional commits since the last revision: - Highlight active graph when the Outline window is re-opened - Avoid unnecessary setting of 'result' to null - Wait for last graph update before highlighting it ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9167/files - new: https://git.openjdk.org/jdk/pull/9167/files/e4dd94c3..707827bf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9167&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9167&range=00-01 Stats: 25 lines in 1 file changed: 15 ins; 7 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9167.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9167/head:pull/9167 PR: https://git.openjdk.org/jdk/pull/9167 From rcastanedalo at openjdk.org Thu Jun 16 10:22:35 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 16 Jun 2022 10:22:35 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus [v2] In-Reply-To: References: Message-ID: <71b31GN7AtTRG2l6PfEF31xBJG3b-B1R-rf730oCKVk=.fc79a21b-851e-4c01-9419-8901814dcca2@github.com> On Thu, 16 Jun 2022 07:13:49 GMT, Xin Liu wrote: >> src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/FolderNode.java line 80: >> >>> 78: for (Node n : nodes) { >>> 79: // Each node is only present once in the graphNode map. >>> 80: graphNode.values().remove(n); >> >> Is `destroyNodes()` thread-safe here? graphNode is a HashMap instead of ConcurrentHashMap. > > This method is by the inner class of RemoveCookie(). I am okay if the framwork guarantees to execute RemoveCookie sequentially. This is guaranteed since `remove()` is called from the event dispatch thread. ------------- PR: https://git.openjdk.org/jdk/pull/9167 From rcastanedalo at openjdk.org Thu Jun 16 10:22:38 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 16 Jun 2022 10:22:38 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus [v2] In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 06:42:24 GMT, Xin Liu wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with three additional commits since the last revision: >> >> - Highlight active graph when the Outline window is re-opened >> - Avoid unnecessary setting of 'result' to null >> - Wait for last graph update before highlighting it > > src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/OutlineTopComponent.java line 221: > >> 219: public void resultChanged(LookupEvent lookupEvent) { >> 220: // Highlight the focused graph, if available, in the outline. >> 221: if (result.allItems().isEmpty()) { > > It looks like you are confident that result is not NULL when resultChanged() is called. I believe framework calls it after `componentOpened()`. if so, is it possible to remove result = null in `componentClosed()`? Right, see e.g. the code listed in https://urldefense.com/v3/__https://netbeans.apache.org/tutorials/nbm-selection-1.html*_creating_a_context_sensitive_topcomponent__;Iw!!ACWV5N9M2RV99hQ!LiR13Y2Xu04_5jtu5YCN05ErEp6Ycm-xoItlMNKfVDkCqSl-udQ49QbUQT-v6xlynX1AkbMGgnDePZWC1S_2pTVneHBaMbcWdq-lLQ$ which follows the same pattern. `result = null` is just added for consistency with other TopComponent classes in the project, e.g: https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/b2a58bec4a4f70a06b23013cc4c351b36a413521/src/utils/IdealGraphVisualizer/ControlFlow/src/main/java/com/sun/hotspot/igv/controlflow/ControlFlowTopComponent.java*L134__;Iw!!ACWV5N9M2RV99hQ!LiR13Y2Xu04_5jtu5YCN05ErEp6Ycm-xoItlMNKfVDkCqSl-udQ49QbUQT-v6xlynX1AkbMGgnDePZWC1S_2pTVneHBaMbcVWFwXsg$ But you are right, it does not seem necessary and I just removed it. > src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/OutlineTopComponent.java line 229: > >> 227: } >> 228: try { >> 229: manager.setSelectedNodes(new GraphNode[]{FolderNode.getGraphNode(p.getGraph())}); > > Do we need to consider that FolderNode.getGraphNode() returns null? `setSelectedNodes` will throw `IllegalArgumentException` if input is null? `FolderNode.getGraphNode()` should never return null, so if that happens I think the best option is to warn the user and log the `IllegalArgumentException` exception thrown by `setSelectedNodes()`, which is what `Exceptions.printStackTrace()` does. See https://bits.netbeans.org/12.6/javadoc/org-openide-explorer/org/openide/explorer/ExplorerManager.html#setSelectedNodes-org.openide.nodes.Node:A- and https://bits.netbeans.org/12.6/javadoc/org-openide-util/org/openide/util/Exceptions.html#printStackTrace-java.lang.Throwable-. ------------- PR: https://git.openjdk.org/jdk/pull/9167 From rcastanedalo at openjdk.org Thu Jun 16 10:24:13 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 16 Jun 2022 10:24:13 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 07:09:29 GMT, Xin Liu wrote: > hi, @robcasloz, This patch works perfectly on MacOS. --lx Great, thanks for testing Xin! I just addressed your comments. Additionally, I realized the changeset did not handle correctly the case where the Outline window is closed and then opened again. This is also fixed in the updated version. ------------- PR: https://git.openjdk.org/jdk/pull/9167 From jbhateja at openjdk.org Thu Jun 16 12:26:25 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 16 Jun 2022 12:26:25 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: <3NOxt5WPlQaJWxoOpxdoYGHOjhUTAd00BCRsWe_oZWk=.c3cf718d-d230-49e4-91ed-a945b349c585@github.com> References: <3NOxt5WPlQaJWxoOpxdoYGHOjhUTAd00BCRsWe_oZWk=.c3cf718d-d230-49e4-91ed-a945b349c585@github.com> Message-ID: On Mon, 13 Jun 2022 18:09:48 GMT, Vladimir Kozlov wrote: >> @jatin-bhateja, could you please help to check the influence about removing this? Kindly know your feedback about this. Thanks so much! > > It was added specifically for #302 > It used for ArrayCopyPartialInlineSize code [macroArrayCopy.cpp#L250](https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macroArrayCopy.cpp*L250__;Iw!!ACWV5N9M2RV99hQ!Odyhi2OwwxGNICZz-RHExh3govoHMIvxlXEyVrC0VEbLRquy5EgQxqR2Mf_mbcfz9sZ72KEvjzA6Lz8hYW1uxdbL-DtMmODR$ ) > Yes, it is strange that it is forced to be in register always. The instruction in x86.ad should be enough to put it into register. > @jatin-bhateja, could you please help to check the influence about removing this? Kindly know your feedback about this. Thanks so much! I think, this can be modified, GVN based sharing should be sufficient here. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From jbhateja at openjdk.org Thu Jun 16 12:26:30 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 16 Jun 2022 12:26:30 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: On Tue, 14 Jun 2022 08:59:38 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - Address review comments, revert changes for gatherL/scatterL rules > - Merge branch 'jdk:master' into JDK-8286941 > - Revert transformation from MaskAll to VectorMaskGen, address review comments > - 8286941: Add mask IR for partial vector operations for ARM SVE src/hotspot/share/opto/vectornode.cpp line 864: > 862: // Generate a vector mask for vector operation whose vector length is lower than the > 863: // hardware supported max vector length. > 864: if (vt->length_in_bytes() < MaxVectorSize) { For completeness, length comparison check can be done against MIN(SuperWordMaxVectorSize, MaxVectorSize). Even though SuperWordMaxVector differs from MaxVectorSize only for certain X86 targets and this control flow is only executed for AARCH64 SVE targets currently. src/hotspot/share/opto/vectornode.cpp line 1013: > 1011: } > 1012: } > 1013: return LoadVectorNode::Ideal(phase, can_reshape); These predicated nodes are concrete ones with fixed species and carry user specified mask, I am not clear why do we need a mask re-computation for predicated nodes. Higher lanes of predicated operand should already be zero and mask attached to predicated node should be correct by construction, since mask lane count is always equal to vector lane count. src/hotspot/share/opto/vectornode.cpp line 1033: > 1031: } > 1032: } > 1033: return StoreVectorNode::Ideal(phase, can_reshape); Same as above. src/hotspot/share/opto/vectornode.cpp line 1669: > 1667: if (Matcher::vector_needs_partial_operations(this, vt)) { > 1668: return VectorNode::try_to_gen_masked_vector(phase, this, vt); > 1669: } This is a parent node of TrueCount/FirstTrue/LastTrue and MaskToLong which perform mask querying operation on concrete predicate operands, a transformation here looks redundant to me. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From jbhateja at openjdk.org Thu Jun 16 12:26:32 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 16 Jun 2022 12:26:32 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: <5F-FZDy-oVnAivutDnQYQJueeYaZx9btodm6EZ7lUjc=.37be4f5d-65eb-479a-afa6-289b6b3b1e48@github.com> On Mon, 13 Jun 2022 01:47:52 GMT, Xiaohong Gong wrote: >>>> And I don't see in(2)->Opcode() == Op_VectorMaskGen check. >> >>>Yes, the Op_VectorMaskGen is not generated for MaskAll when its input is a constant. We directly transform the MaskAll to VectorMaskGen here, since they two have the same meanings. Thanks! >> >> I'm sorry that my comment in line-1819 is not right which misunderstood you. I will change this later. Thanks! > > I prefer to not transform `MaskAll` to `VectorMaskGen` now, since there are the match rules using `MaskAll m1` both in sve and avx-512. Doing the transformation may influence those rules. > I think changes in #8877 influences the max vector length in superword? And since `MaskAll` is used for VectorAPI, the `MaxVectorSize` is still the right reference? @jatin-bhateja, could you please help to check whether this has any influence on x86 avx-512 system? Thanks so much! All these transforms are guarded and currently enabled only for AARCH64, do not think it will impact AVX512. > > > And I don't see in(2)->Opcode() == Op_VectorMaskGen check. > > Yes, the `Op_VectorMaskGen` is not generated for `MaskAll` when its input is a constant. We directly transform the `MaskAll` to `VectorMaskGen` here, since they two have the same meanings. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From sviswanathan at openjdk.org Thu Jun 16 14:49:04 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 16 Jun 2022 14:49:04 GMT Subject: RFR: 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." [v2] In-Reply-To: References: Message-ID: <5wKGznnVc-vxaV4U3TP0z3u0Y9-Rqt8GNH4Qt6ZipkI=.10944429-ed93-45fc-80ba-dfa8dc311bb6@github.com> > The IR Framework test was failing due to incorrect node name. > Corrected the IR node name check. > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Use IRNode instead of string ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9177/files - new: https://git.openjdk.org/jdk/pull/9177/files/5baf4623..e8028014 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9177&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9177&range=00-01 Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9177.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9177/head:pull/9177 PR: https://git.openjdk.org/jdk/pull/9177 From sviswanathan at openjdk.org Thu Jun 16 14:49:05 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 16 Jun 2022 14:49:05 GMT Subject: RFR: 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." [v2] In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 07:13:41 GMT, Christian Hagedorn wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Use IRNode instead of string > > test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 87: > >> 85: >> 86: @Test >> 87: @IR(counts = {"VectorCastF2X", "> 0"}) > > You can directly use `IRNode.VECTOR_CAST_F2X` (and `IRNode.VECTOR_CAST_D2X` below) which makes it clearer that this is an actual IR node and not a custom string. @chhagedorn I made the change accordingly. ------------- PR: https://git.openjdk.org/jdk/pull/9177 From aph at openjdk.org Thu Jun 16 14:59:40 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 16 Jun 2022 14:59:40 GMT Subject: RFR: 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp Message-ID: This is a very light cleanup of the layout of assembler_aarch64.hpp. I also corrected a few section headers that didn't match the section names in the Arm Architectural Reference Manual. ------------- Commit messages: - 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp - 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp - Merge https://urldefense.com/v3/__https://github.com/openjdk/jdk__;!!ACWV5N9M2RV99hQ!Jqb8SxNN9fuIzRgVVgl3AwGprUcz0Jss1_w7-NG-9wN6sTt8_fd4mY1M_9xfHQNzmyJWINbohqHWmVoBDz_WL1wDEw$ into JDK-8288478 - 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp Changes: https://git.openjdk.org/jdk/pull/9185/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9185&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288478 Stats: 154 lines in 1 file changed: 9 ins; 4 del; 141 mod Patch: https://git.openjdk.org/jdk/pull/9185.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9185/head:pull/9185 PR: https://git.openjdk.org/jdk/pull/9185 From thartmann at openjdk.org Thu Jun 16 15:19:13 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 16 Jun 2022 15:19:13 GMT Subject: RFR: 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." [v2] In-Reply-To: <5wKGznnVc-vxaV4U3TP0z3u0Y9-Rqt8GNH4Qt6ZipkI=.10944429-ed93-45fc-80ba-dfa8dc311bb6@github.com> References: <5wKGznnVc-vxaV4U3TP0z3u0Y9-Rqt8GNH4Qt6ZipkI=.10944429-ed93-45fc-80ba-dfa8dc311bb6@github.com> Message-ID: On Thu, 16 Jun 2022 14:49:04 GMT, Sandhya Viswanathan wrote: >> The IR Framework test was failing due to incorrect node name. >> Corrected the IR node name check. >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Use IRNode instead of string Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9177 From chagedorn at openjdk.org Thu Jun 16 15:35:44 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 16 Jun 2022 15:35:44 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 Message-ID: [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: "(x + x) << c0" into "x << (c0 + 1)" However, in the long shift case, this transformation is not correct if `c0` is 63: (x + x) << 63 = 2x << 63 while (x + x) << 63 --transform--> x << 64 = x << 0 = x which is not the same. For example, if `x = 1`: 2x << 63 = 2 << 63 = 0 != 1 This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!LL92wmUr0_HnDM9rLcUGohhst3Y6sLwdo76BqHJFrjE2yu5qOStHEwkn-0k38Nnu060piUV6jynRETcqXMBPAM-DgE9aCRzEIA$ The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. Thanks, Christian ------------- Commit messages: - 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 Changes: https://git.openjdk.org/jdk19/pull/29/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=29&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288564 Stats: 38 lines in 2 files changed: 36 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk19/pull/29.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/29/head:pull/29 PR: https://git.openjdk.org/jdk19/pull/29 From ngasson at openjdk.org Thu Jun 16 16:06:02 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Thu, 16 Jun 2022 16:06:02 GMT Subject: RFR: 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 14:16:28 GMT, Andrew Haley wrote: > This is a very light cleanup of the layout of assembler_aarch64.hpp. I also corrected a few section headers that didn't match the section names in the Arm Architectural Reference Manual. src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2423: > 2421: > 2422: // Advanced SIMD three different > 2423: #define INSN(NAME, opc, opc2, acceptT2D) \ This backslash is misaligned now. ------------- PR: https://git.openjdk.org/jdk/pull/9185 From shade at openjdk.org Thu Jun 16 16:31:55 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 16 Jun 2022 16:31:55 GMT Subject: RFR: 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 14:16:28 GMT, Andrew Haley wrote: > This is a very light cleanup of the layout of assembler_aarch64.hpp. I also corrected a few section headers that didn't match the section names in the Arm Architectural Reference Manual. Looks fine, modulo the misaligned backslash Nick already mentioned. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/9185 From aph at openjdk.org Thu Jun 16 16:40:59 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 16 Jun 2022 16:40:59 GMT Subject: RFR: 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp [v2] In-Reply-To: References: Message-ID: > This is a very light cleanup of the layout of assembler_aarch64.hpp. I also corrected a few section headers that didn't match the section names in the Arm Architectural Reference Manual. Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9185/files - new: https://git.openjdk.org/jdk/pull/9185/files/283ed81a..8612bbc0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9185&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9185&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9185.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9185/head:pull/9185 PR: https://git.openjdk.org/jdk/pull/9185 From aph at openjdk.org Thu Jun 16 16:42:18 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 16 Jun 2022 16:42:18 GMT Subject: Integrated: 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp In-Reply-To: References: Message-ID: <5vjfXc7O_1nGfhS5oTmPxTvmCG1iWPObsCbge1pFB6g=.a2dd364a-f190-45d5-b607-817e02249008@github.com> On Thu, 16 Jun 2022 14:16:28 GMT, Andrew Haley wrote: > This is a very light cleanup of the layout of assembler_aarch64.hpp. I also corrected a few section headers that didn't match the section names in the Arm Architectural Reference Manual. This pull request has now been integrated. Changeset: 2cf7c017 Author: Andrew Haley URL: https://git.openjdk.org/jdk/commit/2cf7c0175977defa765b2acf33a857b9ead1a243 Stats: 153 lines in 1 file changed: 9 ins; 4 del; 140 mod 8288478: AArch64: Clean up whitespace in assembler_aarch64.hpp Reviewed-by: shade ------------- PR: https://git.openjdk.org/jdk/pull/9185 From kvn at openjdk.org Thu Jun 16 17:09:51 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 16 Jun 2022 17:09:51 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 15:27:42 GMT, Christian Hagedorn wrote: > [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: > > "(x + x) << c0" into "x << (c0 + 1)" > > However, in the long shift case, this transformation is not correct if `c0` is 63: > > > (x + x) << 63 = 2x << 63 > > while > > (x + x) << 63 --transform--> x << 64 = x << 0 = x > > which is not the same. For example, if `x = 1`: > > 2x << 63 = 2 << 63 = 0 != 1 > > This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). > > Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: > > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!NgSy4cXXFdOP3hWRU50qCPbZdHy6I2V2k2esqYQl6jBwX_TiRxQJbDUu0gKCA-vG6C2PUQrKuk0DDhTxUPeguKWqvQ$ > > The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. > > Thanks, > Christian Good. Just one comment. src/hotspot/share/opto/mulnode.cpp line 816: > 814: if (add1->in(1) == add1->in(2)) { > 815: // Convert "(x + x) << c0" into "x << (c0 + 1)" > 816: assert(con != BitsPerJavaInteger - 1, "sanity check, optimization cannot be applied for con == 31"); The assert is useless because there is check at line 812 `(con < 16). ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/29 From iveresov at openjdk.org Thu Jun 16 17:15:46 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Thu, 16 Jun 2022 17:15:46 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 15:27:42 GMT, Christian Hagedorn wrote: > [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: > > "(x + x) << c0" into "x << (c0 + 1)" > > However, in the long shift case, this transformation is not correct if `c0` is 63: > > > (x + x) << 63 = 2x << 63 > > while > > (x + x) << 63 --transform--> x << 64 = x << 0 = x > > which is not the same. For example, if `x = 1`: > > 2x << 63 = 2 << 63 = 0 != 1 > > This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). > > Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: > > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!IN2wDcx5kzntMeIKiQhptQCq99gsSV7AltjMvjDzqZI61_AhkqxvLAxg6Sqx2C_wiGTrjY3VYC3OpUGIIiwnmqO5K6dSWoqs$ > > The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. > > Thanks, > Christian Marked as reviewed by iveresov (Reviewer). ------------- PR: https://git.openjdk.org/jdk19/pull/29 From sviswanathan at openjdk.org Thu Jun 16 22:10:47 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 16 Jun 2022 22:10:47 GMT Subject: Integrated: 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: <72OI6GMg3426WOihovhYj2i6yuDNwakTNmAhmAHjbaE=.42b2490c-3240-4cb8-828d-b3196c79a1a4@github.com> On Thu, 16 Jun 2022 02:34:33 GMT, Sandhya Viswanathan wrote: > The IR Framework test was failing due to incorrect node name. > Corrected the IR node name check. > > Please review. > > Best Regards, > Sandhya This pull request has now been integrated. Changeset: 9d4b25e7 Author: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/9d4b25e7888098a866ff980e37b8d16d456906d8 Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod 8288281: compiler/vectorapi/VectorFPtoIntCastTest.java failed with "IRViolationException: There were one or multiple IR rule failures." Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/9177 From xgong at openjdk.org Fri Jun 17 01:27:57 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 17 Jun 2022 01:27:57 GMT Subject: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules In-Reply-To: References: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> Message-ID: On Thu, 16 Jun 2022 05:59:43 GMT, Ningsheng Jian wrote: >> There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. >> >> The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". >> >> And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. > > Thanks for the fix! Thanks for the review @nsjian ! ------------- PR: https://git.openjdk.org/jdk19/pull/17 From xgong at openjdk.org Fri Jun 17 01:27:58 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 17 Jun 2022 01:27:58 GMT Subject: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules In-Reply-To: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> References: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> Message-ID: On Wed, 15 Jun 2022 09:40:52 GMT, Xiaohong Gong wrote: > There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. > > The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". > > And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. Hi @nick-arm, may I have your review for this PR? Thanks for your time! ------------- PR: https://git.openjdk.org/jdk19/pull/17 From rahul.kandu at intel.com Fri Jun 17 01:36:49 2022 From: rahul.kandu at intel.com (Kandu, Rahul) Date: Fri, 17 Jun 2022 01:36:49 +0000 Subject: FW: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules In-Reply-To: References: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> Message-ID: -----Original Message----- From: hotspot-compiler-dev On Behalf Of Xiaohong Gong Sent: Thursday, June 16, 2022 6:28 PM To: hotspot-compiler-dev at openjdk.org Subject: Re: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules On Wed, 15 Jun 2022 09:40:52 GMT, Xiaohong Gong wrote: > There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. > > The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". > > And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. Hi @nick-arm, may I have your review for this PR? Thanks for your time! ------------- PR: https://git.openjdk.org/jdk19/pull/17 From xliu at openjdk.org Fri Jun 17 05:24:49 2022 From: xliu at openjdk.org (Xin Liu) Date: Fri, 17 Jun 2022 05:24:49 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus [v2] In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 10:22:33 GMT, Roberto Casta?eda Lozano wrote: >> This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://urldefense.com/v3/__https://netbeans.apache.org/tutorials/nbm-selection-2.html__;!!ACWV5N9M2RV99hQ!MfYW5ZqFu_p5aVwK0do4NlMNneXCiqCWHCQruk2fdW-4C8DqDWbpA6fu6NTwcTutDews_lYIWVaBoiagu-nFt0vWh7o$ ). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. >> >> Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. > > Roberto Casta?eda Lozano has updated the pull request incrementally with three additional commits since the last revision: > > - Highlight active graph when the Outline window is re-opened > - Avoid unnecessary setting of 'result' to null > - Wait for last graph update before highlighting it LGTM. I am not a review. Still need other reviewers approve it. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.org/jdk/pull/9167 From epeter at openjdk.org Fri Jun 17 06:47:21 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 17 Jun 2022 06:47:21 GMT Subject: RFR: 8287801: Fix test-bugs related to stress flags Message-ID: I recently ran many tests with additional stress flags. While there were a few bugs I found, most of the issues were test bugs. I found a list of tests that are problematic with specific stress flags, I adjusted the tests so that they can now be run with the flag. Often I just fix the flag at the default value, so that setting it from the outside does not affect the test. Below I explain for each test how and why I adjusted the test. - test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyNoInitDeopt.java - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` - Disabling traps by setting limit of traps to zero, means some optimistic optimizations are not made, and can therefore not lead to deoptimization. The test expects deoptimization due to traps, so we need to have them on. - test/hotspot/jtreg/compiler/c2/cr7200264/TestDriver.java - used by TestSSE2IntVect.java and TestSSE4IntVect.java - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE2IntVect.java - Problem Flags: `-XX:StressLongCountedLoop=2000000` - Test checks the IR, and if we convert loops to long loops, some operations will not show up anymore, the test fails. Disable StressLongCountedLoop. - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE4IntVect.java - See TestSSE2IntVect.java - test/hotspot/jtreg/compiler/c2/irTests/blackhole/BlackholeStoreStoreEATest.java - Problem Flags: `-XX:-UseTLAB` - No thread local allocation (TLAB) means IR is changed. Test checks for MemBarStoreStore, which is missing without TLAB. Solution: always have TLAB on. - test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` - Messes with recompilation, makes assert fail that expects recompilation. Must disable flag. - test/hotspot/jtreg/compiler/cha/DefaultRootMethod.java - see AbstractRootMethod.java - test/hotspot/jtreg/compiler/intrinsics/klass/CastNullCheckDroppingsTest.java - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` - Need traps, otherwise some optimistic optimisations are not made, and then they also are not trapped and deoptimized. - Problem Flags: `-XX:TypeProfileLevel=222` - Profiling also messes with optimizations / deoptimization. - Problem Flags: `-XX:+StressReflectiveCode` - Messes with types at allocation, which messes with optimizations. - Problem Flags: `-XX:-UncommonNullCast` - Is required for trapping in null checks. - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` - Messes with inlining / optimization - turn it off. - test/hotspot/jtreg/compiler/jvmci/compilerToVM/IsMatureVsReprofileTest.java - Problem Flags: `-XX:Tier4BackEdgeThreshold=1 -Xbatch -XX:-TieredCompilation` - Lead to OSR compilation in loop calling `testMethod`, which is expected to be compiled. But with the OSR compilation, that function is inlined, and never compiled. Solution was to make sure we only compile `testMethod`. - test/hotspot/jtreg/compiler/jvmci/compilerToVM/ReprofileTest.java - Problem Flags: `-XX:TypeProfileLevel=222` - Changing profile flags messes with test, which assumes default behavior. - test/hotspot/jtreg/compiler/profiling/TestTypeProfiling.java - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` - Need traps to check for optimistic optimizations. - test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` - Need traps to check for optimistic optimizations. - test/hotspot/jtreg/compiler/rangechecks/TestLongRangeCheck.java - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` - Need traps to check for optimistic optimizations. - test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckSmearing.java - Problem Flags: `-XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion` - Test expects to be run at compilation tier 4 / C2, so must fix it at that in requirement. - test/hotspot/jtreg/compiler/uncommontrap/Decompile.java - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0 -XX:PerBytecodeTrapLimit=0` - The tests if we trap and decompile after we call member functions of a class that we did not use before. If we disable traps, then internally it uses a virtual call, and no deoptimization is required - but the test expects trapping and deoptimization. Solution: set trap limits to default. - Problem Flags: `-XX:TypeProfileLevel=222` - Changing profiling behavior also messes with deoptimization - disable it. - test/hotspot/jtreg/compiler/uncommontrap/TestUnstableIfTrap.java - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` - Test expects traps, so we must ensure the limits are at default. ------------- Commit messages: - Merge branch 'master' into JDK-8287801 - 8287801: Fix test-bugs related to stress flags Changes: https://git.openjdk.org/jdk/pull/9186/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9186&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287801 Stats: 27 lines in 16 files changed: 23 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/9186.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9186/head:pull/9186 PR: https://git.openjdk.org/jdk/pull/9186 From rcastanedalo at openjdk.org Fri Jun 17 07:34:57 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 17 Jun 2022 07:34:57 GMT Subject: Integrated: 8288421: IGV: warn user about all unreachable nodes In-Reply-To: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> References: <4ZHwSlqIsbDWFokTr-mioh1a-n4XIZWr7eK1vgVOMAA=.e97a46cd-68a3-4d71-904e-f9ca433f6046@github.com> Message-ID: <-449Ck0YzMAGQP5LYyNFebLBXopUF-Lomix7MczSfxY=.507daf73-25e0-4655-a0d9-1914d447e343@github.com> On Wed, 15 Jun 2022 08:50:39 GMT, Roberto Casta?eda Lozano wrote: > This changeset ensures that, when approximating C2's schedule, IGV does not schedule unreachable nodes. Instead, a node warning is emitted, informing the user that the corresponding node is unreachable. This information can be useful when debugging ill-formed graphs. > > The following clustered subgraph illustrates the proposed change: > > ![before-after](https://urldefense.com/v3/__https://user-images.githubusercontent.com/8792647/173784252-5fccb80b-7c36-49bf-8c52-eed502cc129c.png__;!!ACWV5N9M2RV99hQ!NcAt_Ei55GPnByRh904XzmCjUL0pkmrp-Q5caWUXab7zb0v88LCtYXqdBLI9Xe3sjX-do_tNkNFTeNHMwfSPmsHmFrvD29YuB7pLfQ$ ) > > Currently _(before)_, `522 IfFalse` gets assigned the same block as `256 Region` (`B11`) in an effort to schedule as many nodes as possible, and hence no warning is emitted for `522 IfFalse`, even though it is clearly control-unreachable (since it is a child of `520 If` which is control-unreachable). This changeset _(after)_ leaves instead `522 IfFalse` unscheduled and emits a "Control-unreachable CFG node" warning for it (visible as a tooltip of the node warning sign). > > As a side-benefit, the changeset simplifies the IGV scheduling algorithm by removing the code that tries to schedule unrechable nodes code on a best-effort basis, and adds two additional node warnings ("Region with multiple successors" and "CFG node without control successors") to highlight the new nodes that might remain unscheduled as a consequence. > > #### Testing > > - Tested manually on the [graph](https://bugs.openjdk.org/secure/attachment/99555/graph.xml) attached to the JBS issue. > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does trigger any exception or assertion failure. This pull request has now been integrated. Changeset: f3da7ff6 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/f3da7ff66e83a44118c090b7729dce858f0df1b1 Stats: 91 lines in 1 file changed: 11 ins; 79 del; 1 mod 8288421: IGV: warn user about all unreachable nodes Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9164 From chagedorn at openjdk.org Fri Jun 17 08:31:46 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 17 Jun 2022 08:31:46 GMT Subject: RFR: 8287801: Fix test-bugs related to stress flags In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 14:44:19 GMT, Emanuel Peter wrote: > I recently ran many tests with additional stress flags. While there were a few bugs I found, most of the issues were test bugs. > > I found a list of tests that are problematic with specific stress flags, I adjusted the tests so that they can now be run with the flag. Often I just fix the flag at the default value, so that setting it from the outside does not affect the test. > > Below I explain for each test how and why I adjusted the test. > > - test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyNoInitDeopt.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Disabling traps by setting limit of traps to zero, means some optimistic optimizations are not made, and can therefore not lead to deoptimization. The test expects deoptimization due to traps, so we need to have them on. > - test/hotspot/jtreg/compiler/c2/cr7200264/TestDriver.java > - used by TestSSE2IntVect.java and TestSSE4IntVect.java > - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE2IntVect.java > - Problem Flags: `-XX:StressLongCountedLoop=2000000` > - Test checks the IR, and if we convert loops to long loops, some operations will not show up anymore, the test fails. Disable StressLongCountedLoop. > - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE4IntVect.java > - See TestSSE2IntVect.java > - test/hotspot/jtreg/compiler/c2/irTests/blackhole/BlackholeStoreStoreEATest.java > - Problem Flags: `-XX:-UseTLAB` > - No thread local allocation (TLAB) means IR is changed. Test checks for MemBarStoreStore, which is missing without TLAB. Solution: always have TLAB on. > - test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java > - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` > - Messes with recompilation, makes assert fail that expects recompilation. Must disable flag. > - test/hotspot/jtreg/compiler/cha/DefaultRootMethod.java > - see AbstractRootMethod.java > - test/hotspot/jtreg/compiler/intrinsics/klass/CastNullCheckDroppingsTest.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps, otherwise some optimistic optimisations are not made, and then they also are not trapped and deoptimized. > - Problem Flags: `-XX:TypeProfileLevel=222` > - Profiling also messes with optimizations / deoptimization. > - Problem Flags: `-XX:+StressReflectiveCode` > - Messes with types at allocation, which messes with optimizations. > - Problem Flags: `-XX:-UncommonNullCast` > - Is required for trapping in null checks. > - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` > - Messes with inlining / optimization - turn it off. > - test/hotspot/jtreg/compiler/jvmci/compilerToVM/IsMatureVsReprofileTest.java > - Problem Flags: `-XX:Tier4BackEdgeThreshold=1 -Xbatch -XX:-TieredCompilation` > - Lead to OSR compilation in loop calling `testMethod`, which is expected to be compiled. But with the OSR compilation, that function is inlined, and never compiled. Solution was to make sure we only compile `testMethod`. > - test/hotspot/jtreg/compiler/jvmci/compilerToVM/ReprofileTest.java > - Problem Flags: `-XX:TypeProfileLevel=222` > - Changing profile flags messes with test, which assumes default behavior. > - test/hotspot/jtreg/compiler/profiling/TestTypeProfiling.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestLongRangeCheck.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckSmearing.java > - Problem Flags: `-XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion` > - Test expects to be run at compilation tier 4 / C2, so must fix it at that in requirement. > - test/hotspot/jtreg/compiler/uncommontrap/Decompile.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0 -XX:PerBytecodeTrapLimit=0` > - The tests if we trap and decompile after we call member functions of a class that we did not use before. If we disable traps, then internally it uses a virtual call, and no deoptimization is required - but the test expects trapping and deoptimization. Solution: set trap limits to default. > - Problem Flags: `-XX:TypeProfileLevel=222` > - Changing profiling behavior also messes with deoptimization - disable it. > - test/hotspot/jtreg/compiler/uncommontrap/TestUnstableIfTrap.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Test expects traps, so we must ensure the limits are at default. Nice analysis and summary of the failures! The test fixes look good to me! test/hotspot/jtreg/compiler/c2/irTests/blackhole/BlackholeStoreStoreEATest.java line 42: > 40: public static void main(String[] args) { > 41: TestFramework.runWithFlags( > 42: "-XX:+UseTLAB", Just as a side note: As this example shows, treating `UseTLAB` as having no effect on the IR (i.e. on the IR framework whitelist) is incorrect. We should think about removing it from the whitelist at some point. But fixing this test like that is perfectly fine. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9186 From chagedorn at openjdk.org Fri Jun 17 08:38:53 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 17 Jun 2022 08:38:53 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 16:33:30 GMT, Vladimir Kozlov wrote: >> [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: >> >> "(x + x) << c0" into "x << (c0 + 1)" >> >> However, in the long shift case, this transformation is not correct if `c0` is 63: >> >> >> (x + x) << 63 = 2x << 63 >> >> while >> >> (x + x) << 63 --transform--> x << 64 = x << 0 = x >> >> which is not the same. For example, if `x = 1`: >> >> 2x << 63 = 2 << 63 = 0 != 1 >> >> This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). >> >> Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: >> >> https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!KmKUUYOAlI0VvywPQaVXdxu-rJyyxKvCWxQZwZqa9VKtndvSljVPuQnnTIr-YXQ3X_A6DmYzgb9jnFR9GGMGVr08uQe7Ck-V_g$ >> >> The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. >> >> Thanks, >> Christian > > src/hotspot/share/opto/mulnode.cpp line 816: > >> 814: if (add1->in(1) == add1->in(2)) { >> 815: // Convert "(x + x) << c0" into "x << (c0 + 1)" >> 816: assert(con != BitsPerJavaInteger - 1, "sanity check, optimization cannot be applied for con == 31"); > > The assert is useless because there is check at line 812 `(con < 16). Yes, that's true. I just wanted to make sure that we do not forget about the fact that we cannot apply this optimization for `con == 31` if this optimization is moved at some point or we decide to remove the `con < 16` restriction. But I guess I can also just add a comment instead. ------------- PR: https://git.openjdk.org/jdk19/pull/29 From chagedorn at openjdk.org Fri Jun 17 08:46:39 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 17 Jun 2022 08:46:39 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 [v2] In-Reply-To: References: Message-ID: > [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: > > "(x + x) << c0" into "x << (c0 + 1)" > > However, in the long shift case, this transformation is not correct if `c0` is 63: > > > (x + x) << 63 = 2x << 63 > > while > > (x + x) << 63 --transform--> x << 64 = x << 0 = x > > which is not the same. For example, if `x = 1`: > > 2x << 63 = 2 << 63 = 0 != 1 > > This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). > > Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: > > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!ICZkZ9LN5SiFfPdb_0uKzKgdF8PS_wHwhOdRPnzCAFo-FXzQEph7m8vlhGz1gtm7OEa3axFS6CrSBBGERFQnKvctc9UqfSPlyQ$ > > The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with two additional commits since the last revision: - typo - change assert to comment ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/29/files - new: https://git.openjdk.org/jdk19/pull/29/files/37040f94..45b1e994 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=29&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=29&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/29.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/29/head:pull/29 PR: https://git.openjdk.org/jdk19/pull/29 From chagedorn at openjdk.org Fri Jun 17 08:53:46 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 17 Jun 2022 08:53:46 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus [v2] In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 10:22:33 GMT, Roberto Casta?eda Lozano wrote: >> This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://urldefense.com/v3/__https://netbeans.apache.org/tutorials/nbm-selection-2.html__;!!ACWV5N9M2RV99hQ!JF2RjG3s96NOc9ybGEnxC6yh-6T4Wq2sJ6MrdsckGUJB0aHCqnWNPylR4Hhk_8pcGlrTfCFSDC4oOtwyQ63n-IteqZX23Q2iwQ$ ). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. >> >> Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. > > Roberto Casta?eda Lozano has updated the pull request incrementally with three additional commits since the last revision: > > - Highlight active graph when the Outline window is re-opened > - Avoid unnecessary setting of 'result' to null > - Wait for last graph update before highlighting it Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9167 From chagedorn at openjdk.org Fri Jun 17 08:56:46 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 17 Jun 2022 08:56:46 GMT Subject: RFR: 8288480: IGV: toolbar action is not applied to the focused graph In-Reply-To: References: Message-ID: <5HwYpbKPrqFGn2_qlqJeLdYWC1ZHAgzZzcjG5c6wiIk=.e537ab9b-9d0e-4c2b-9b15-028dc0f97c10@github.com> On Wed, 15 Jun 2022 13:45:53 GMT, Roberto Casta?eda Lozano wrote: > When multiple graphs are displayed simultaneously in split windows, the following toolbar actions are always applied to the same graph, regardless of which graph window is focused: > > - search nodes and blocks > - extract node > - hide node > - show all nodes > - zoom in > - zoom out > > This changeset ensures that each of the above actions is only applied within its corresponding graph window. This is achieved by applying the actions to the graph window that is currently activated (`EditorTopComponent.getRegistry().getActivated()`) instead of the first matching occurrence in `WindowManager.getDefault().getModes()`. > > The changeset makes it practical, for example, to explore different views of the same graph simultaneously, as illustrated here: > > ![multi-view](https://urldefense.com/v3/__https://user-images.githubusercontent.com/8792647/173841115-084c6396-3843-4d9b-9951-f93c932100c3.png__;!!ACWV5N9M2RV99hQ!OoFVJo-CIaqfoGFi4ZiGV7cyHdMA7xCc89Ev6icxalPfvroplEaJ7LaHAx5gfeNJOzKaTVDww3cEwfGSITjK-B-oPRcZQhsxyQ$ ) > > Tested manually by triggering the above actions within multiple split graph windows and asserting that they are only applied to their corresponding graphs. I've also ran into this problem before, thanks for fixing this! Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9169 From rcastanedalo at openjdk.org Fri Jun 17 09:28:02 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 17 Jun 2022 09:28:02 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus [v2] In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 08:50:26 GMT, Christian Hagedorn wrote: > Looks good! Thanks, Christian! ------------- PR: https://git.openjdk.org/jdk/pull/9167 From rcastanedalo at openjdk.org Fri Jun 17 09:28:05 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 17 Jun 2022 09:28:05 GMT Subject: RFR: 8288480: IGV: toolbar action is not applied to the focused graph In-Reply-To: <5HwYpbKPrqFGn2_qlqJeLdYWC1ZHAgzZzcjG5c6wiIk=.e537ab9b-9d0e-4c2b-9b15-028dc0f97c10@github.com> References: <5HwYpbKPrqFGn2_qlqJeLdYWC1ZHAgzZzcjG5c6wiIk=.e537ab9b-9d0e-4c2b-9b15-028dc0f97c10@github.com> Message-ID: On Fri, 17 Jun 2022 08:53:15 GMT, Christian Hagedorn wrote: > I've also ran into this problem before, thanks for fixing this! Looks good! Thanks for reviewing, Christian! ------------- PR: https://git.openjdk.org/jdk/pull/9169 From aph at openjdk.org Fri Jun 17 09:28:14 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 17 Jun 2022 09:28:14 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Wed, 8 Jun 2022 14:13:01 GMT, Evgeny Astigeevich wrote: > If we never patch the branch to the interpreter, we can optimize it at link time either to a direct branch or an adrp based far jump. I also created https://bugs.openjdk.org/browse/JDK-8286142 to reduce metadata mov instructions. If we emit the address of the interpreter once, at the start of the stub section, we can replace the branch to the interpreter with `ldr rscratch1, adr; br rscratch1`. ------------- PR: https://git.openjdk.org/jdk/pull/8816 From epeter at openjdk.org Fri Jun 17 09:48:53 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 17 Jun 2022 09:48:53 GMT Subject: RFR: 8288467: C2: assert(cisc->memory_operand() == nullptr) failed: no memory operand, only stack Message-ID: In [JDK-8282555](https://bugs.openjdk.org/browse/JDK-8282555) I added this assert, because on x64 this seems to always hold. But it turns out there are instructions on x86 (32bit) that violate this assumption. **Why it holds on x64** It seems we only ever do one read or one write per instruction. Instructions with multiple memory operands are extremely rare. 1. we spill from register to memory: we land in the if case, where the cisc node has an additional input slot for the memory edge. 2. we spill from register to stackSlot: no additional input slot is reserved, we land in else case and add an additional precedence edge. **Why it is violated on x86 (32bit)** We have additional cases that land in the else case. For example spilling `src1` from `addFPR24_reg_mem` to `addFPR24_mem_cisc`. https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10325-L10327__;Iw!!ACWV5N9M2RV99hQ!OECebibs9mV50VnvBHuYVR53yehF_LYezwUObWXbuXoUlSCzPdEdTefOaDzLFYjc5g6Ern-O97qbm8DA5kYd3M1QUkKdvQ$ https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10368-L10370__;Iw!!ACWV5N9M2RV99hQ!OECebibs9mV50VnvBHuYVR53yehF_LYezwUObWXbuXoUlSCzPdEdTefOaDzLFYjc5g6Ern-O97qbm8DA5kYd3M2yf79Lyw$ We land in the else case, because both have 2 inputs, thus `oper_input_base() == 2`. And both have memory operands, so the assert must fail. **Solutions** 1. Remove the Assert, as it is incorrect. 2. Extend the assert to be correct. - case 1: reg to mem spill, where we have a reserved input slot in cisc for memory edge - case 2: reg to stackSlot spill, where both mach and cisc have no memory operand. - other cases, with various register, stackSlot and memory inputs and outputs. We would have to find a general rule, and test it properly, which is not trivial because getting registers to spill is not easy to precisely provoke. 3. Have platform dependent asserts. But also this makes testing harder. For now I went with 1. as it is simple and as far as I can see correct. Running tests on x64 (should not fail). **I need someone to help me with testing x86 (32bit)**. I only verified the reported test failure with a 32bit build. ------------- Commit messages: - 8288467: C2: assert(cisc->memory_operand() == nullptr) failed: no memory operand, only stack Changes: https://git.openjdk.org/jdk19/pull/33/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=33&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288467 Stats: 7 lines in 1 file changed: 5 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk19/pull/33.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/33/head:pull/33 PR: https://git.openjdk.org/jdk19/pull/33 From ngasson at openjdk.org Fri Jun 17 12:35:59 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Fri, 17 Jun 2022 12:35:59 GMT Subject: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules In-Reply-To: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> References: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> Message-ID: On Wed, 15 Jun 2022 09:40:52 GMT, Xiaohong Gong wrote: > There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. > > The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". > > And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. Seems fine. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.org/jdk19/pull/17 From kvn at openjdk.org Fri Jun 17 13:49:54 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 17 Jun 2022 13:49:54 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 [v2] In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 08:46:39 GMT, Christian Hagedorn wrote: >> [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: >> >> "(x + x) << c0" into "x << (c0 + 1)" >> >> However, in the long shift case, this transformation is not correct if `c0` is 63: >> >> >> (x + x) << 63 = 2x << 63 >> >> while >> >> (x + x) << 63 --transform--> x << 64 = x << 0 = x >> >> which is not the same. For example, if `x = 1`: >> >> 2x << 63 = 2 << 63 = 0 != 1 >> >> This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). >> >> Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: >> >> https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!NL_DPEVQt8A4YFzZI8Lfw1YWpXSfi4Pg8n6-sY0NkBmeMjfKhmwpRhk6EhvINF-vZ-TW9CaYNUaKOosseK43TNpnSw$ >> >> The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with two additional commits since the last revision: > > - typo > - change assert to comment Good ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/29 From eosterlund at openjdk.org Fri Jun 17 13:51:59 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 17 Jun 2022 13:51:59 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom [v2] In-Reply-To: References: Message-ID: > The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. > If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: Add comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8673/files - new: https://git.openjdk.org/jdk/pull/8673/files/461f8a30..4613da71 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8673&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8673&range=00-01 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/8673.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8673/head:pull/8673 PR: https://git.openjdk.org/jdk/pull/8673 From chagedorn at openjdk.org Fri Jun 17 13:59:00 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 17 Jun 2022 13:59:00 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 [v2] In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 08:46:39 GMT, Christian Hagedorn wrote: >> [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: >> >> "(x + x) << c0" into "x << (c0 + 1)" >> >> However, in the long shift case, this transformation is not correct if `c0` is 63: >> >> >> (x + x) << 63 = 2x << 63 >> >> while >> >> (x + x) << 63 --transform--> x << 64 = x << 0 = x >> >> which is not the same. For example, if `x = 1`: >> >> 2x << 63 = 2 << 63 = 0 != 1 >> >> This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). >> >> Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: >> >> https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!PCfSO6VbseFmID4rV7n6iNMU40nDoVWpNP3V_P6ZmW9fm8bp4cXIxX1FSEu0L-IwF9jNhz3jmz8bSzwJKddGYf4c6LO-sqoyJw$ >> >> The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with two additional commits since the last revision: > > - typo > - change assert to comment Thanks Vladimir and Igor for your reviews! ------------- PR: https://git.openjdk.org/jdk19/pull/29 From rahul.kandu at intel.com Fri Jun 17 15:28:34 2022 From: rahul.kandu at intel.com (Kandu, Rahul) Date: Fri, 17 Jun 2022 15:28:34 +0000 Subject: FW: RFR: 8284404: Too aggressive sweeping with Loom [v2] In-Reply-To: References: Message-ID: -----Original Message----- From: hotspot-compiler-dev On Behalf Of Erik ?sterlund Sent: Friday, June 17, 2022 6:52 AM To: hotspot-compiler-dev at openjdk.org Subject: Re: RFR: 8284404: Too aggressive sweeping with Loom [v2] > The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. > If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: Add comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8673/files - new: https://git.openjdk.org/jdk/pull/8673/files/461f8a30..4613da71 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8673&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8673&range=00-01 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/8673.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8673/head:pull/8673 PR: https://git.openjdk.org/jdk/pull/8673 From rahul.kandu at intel.com Fri Jun 17 15:29:43 2022 From: rahul.kandu at intel.com (Kandu, Rahul) Date: Fri, 17 Jun 2022 15:29:43 +0000 Subject: FW: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules In-Reply-To: References: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> Message-ID: -----Original Message----- From: hotspot-compiler-dev On Behalf Of Nick Gasson Sent: Friday, June 17, 2022 5:36 AM To: hotspot-compiler-dev at openjdk.org Subject: Re: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules On Wed, 15 Jun 2022 09:40:52 GMT, Xiaohong Gong wrote: > There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. > > The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". > > And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. Seems fine. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.org/jdk19/pull/17 From vlivanov at openjdk.org Fri Jun 17 21:44:30 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 17 Jun 2022 21:44:30 GMT Subject: RFR: 8280320: C2: Loop opts are missing during OSR compilation Message-ID: After [JDK-8272330](https://bugs.openjdk.org/browse/JDK-8272330), OSR compilations may completely miss loop optimizations pass due to misleading profiling data. The cleanup changed how profile counts are scaled and it had surprising effect on OSR compilations. For a long-running loop it's common to have an MDO allocated during the first invocation while running in the loop. Also, OSR compilation may be scheduled while running the very first method invocation. In such case, `MethodData::invocation_counter() == 0` while `MethodData::backedge_counter() > 0`. Before JDK-8272330 went in, `ciMethod::scale_count()` took into account both `invocation_counter()` and `backedge_counter()`. Now `MethodData::invocation_counter()` is taken by `ciMethod::scale_count()` as is and it forces all counts to be unconditionally scaled to `1`. It misleads `IdealLoopTree::beautify_loops()` to believe there are no hot backedges in the loop being compiled and `IdealLoopTree::split_outer_loop()` doesn't kick in thus effectively blocking any further loop optimizations. Proposed fix bumps `MethodData::invocation_counter()` from `0` to `1` and enables `ciMethod::scale_count()` to report sane numbers. Testing: - hs-tier1 - hs-tier4 ------------- Commit messages: - 8280320: C2: Loop opts are missing during OSR compilation Changes: https://git.openjdk.org/jdk19/pull/38/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=38&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8280320 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/38.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/38/head:pull/38 PR: https://git.openjdk.org/jdk19/pull/38 From rahul.kandu at intel.com Fri Jun 17 21:46:08 2022 From: rahul.kandu at intel.com (Kandu, Rahul) Date: Fri, 17 Jun 2022 21:46:08 +0000 Subject: FW: RFR: 8280320: C2: Loop opts are missing during OSR compilation In-Reply-To: References: Message-ID: -----Original Message----- From: hotspot-compiler-dev On Behalf Of Vladimir Ivanov Sent: Friday, June 17, 2022 2:45 PM To: hotspot-compiler-dev at openjdk.org Subject: RFR: 8280320: C2: Loop opts are missing during OSR compilation After [JDK-8272330](https://bugs.openjdk.org/browse/JDK-8272330), OSR compilations may completely miss loop optimizations pass due to misleading profiling data. The cleanup changed how profile counts are scaled and it had surprising effect on OSR compilations. For a long-running loop it's common to have an MDO allocated during the first invocation while running in the loop. Also, OSR compilation may be scheduled while running the very first method invocation. In such case, `MethodData::invocation_counter() == 0` while `MethodData::backedge_counter() > 0`. Before JDK-8272330 went in, `ciMethod::scale_count()` took into account both `invocation_counter()` and `backedge_counter()`. Now `MethodData::invocation_counter()` is taken by `ciMethod::scale_count()` as is and it forces all counts to be unconditionally scaled to `1`. It misleads `IdealLoopTree::beautify_loops()` to believe there are no hot backedges in the loop being compiled and `IdealLoopTree::split_outer_loop()` doesn't kick in thus effectively blocking any further loop optimizations. Proposed fix bumps `MethodData::invocation_counter()` from `0` to `1` and enables `ciMethod::scale_count()` to report sane numbers. Testing: - hs-tier1 - hs-tier4 ------------- Commit messages: - 8280320: C2: Loop opts are missing during OSR compilation Changes: https://git.openjdk.org/jdk19/pull/38/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=38&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8280320 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/38.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/38/head:pull/38 PR: https://git.openjdk.org/jdk19/pull/38 From dlong at openjdk.org Fri Jun 17 23:06:26 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 17 Jun 2022 23:06:26 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding Message-ID: The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. ------------- Commit messages: - fix comment - remove -XX:+PrintOptoAssembly left over from testing - Avoid "impossible encoding" for right-shift by 0 Changes: https://git.openjdk.org/jdk19/pull/40/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288445 Stats: 115 lines in 4 files changed: 83 ins; 0 del; 32 mod Patch: https://git.openjdk.org/jdk19/pull/40.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/40/head:pull/40 PR: https://git.openjdk.org/jdk19/pull/40 From duke at openjdk.org Sat Jun 18 06:19:57 2022 From: duke at openjdk.org (duke) Date: Sat, 18 Jun 2022 06:19:57 GMT Subject: Withdrawn: 8285246: AArch64: remove overflow check from InterpreterMacroAssembler::increment_mdp_data_at In-Reply-To: <5WaOUZM5UQphNA3qLOJTpNrKDWUsqwltpIitpxJTDbc=.d43ae883-b571-42af-adb9-500591d2fb91@github.com> References: <5WaOUZM5UQphNA3qLOJTpNrKDWUsqwltpIitpxJTDbc=.d43ae883-b571-42af-adb9-500591d2fb91@github.com> Message-ID: On Fri, 22 Apr 2022 14:35:00 GMT, Nick Gasson wrote: > Several reasons to do this: > > - A 64-bit counter is realistically never going to overflow in the interpreter. The PPC64 port also doesn't check for overflow for this reason. > > - It's inconsistent with C1 which does not check for overflow. (See e.g. `LIRGenerator::profile_branch()` which does `__ leal(...)` to add and explicitly doesn't set the flags.) > > - We're checking for 64-bit overflow as the MDO cells are word-sized but accessors like `BranchData::taken()` silently truncate to uint which is 32 bit. So I don't think this check is doing anything useful. > > - I'd like to experiment with using LSE far atomics to update the MDO counters here, but the overflow check prevents that. > > Tested jtreg tier1-3 and also verified that the counters for a particular test method were the same before and after when run with -Xbatch. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/8363 From xgong at openjdk.org Mon Jun 20 01:11:12 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 01:11:12 GMT Subject: RFR: 8288397: AArch64: Fix register issues in SVE backend match rules In-Reply-To: References: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> Message-ID: On Fri, 17 Jun 2022 12:32:56 GMT, Nick Gasson wrote: >> There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. >> >> The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". >> >> And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. > > Seems fine. Thanks for the review @nick-arm ! ------------- PR: https://git.openjdk.org/jdk19/pull/17 From xgong at openjdk.org Mon Jun 20 01:14:03 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 01:14:03 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: <3NOxt5WPlQaJWxoOpxdoYGHOjhUTAd00BCRsWe_oZWk=.c3cf718d-d230-49e4-91ed-a945b349c585@github.com> Message-ID: On Thu, 16 Jun 2022 09:34:17 GMT, Jatin Bhateja wrote: >> It was added specifically for #302 >> It used for ArrayCopyPartialInlineSize code [macroArrayCopy.cpp#L250](https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macroArrayCopy.cpp*L250__;Iw!!ACWV5N9M2RV99hQ!PDE6W6C4GzlQyZkPThLWxRNsLj_oACvbHEvPJS6-aH_WB6cCRjTfkWC3bP5VuVlpu04GvTI-AUn8B-3YTI6O1vHeFpa5$ ) >> Yes, it is strange that it is forced to be in register always. The instruction in x86.ad should be enough to put it into register. > >> @jatin-bhateja, could you please help to check the influence about removing this? Kindly know your feedback about this. Thanks so much! > > I think, this can be modified, GVN based sharing should be sufficient here. Thanks a lot for looking at this part! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Mon Jun 20 01:14:08 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 01:14:08 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 08:40:18 GMT, Jatin Bhateja wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Address review comments, revert changes for gatherL/scatterL rules >> - Merge branch 'jdk:master' into JDK-8286941 >> - Revert transformation from MaskAll to VectorMaskGen, address review comments >> - 8286941: Add mask IR for partial vector operations for ARM SVE > > src/hotspot/share/opto/vectornode.cpp line 864: > >> 862: // Generate a vector mask for vector operation whose vector length is lower than the >> 863: // hardware supported max vector length. >> 864: if (vt->length_in_bytes() < MaxVectorSize) { > > For completeness, length comparison check can be done against MIN(SuperWordMaxVectorSize, MaxVectorSize). > Even though SuperWordMaxVector differs from MaxVectorSize only for certain X86 targets and this control flow is only executed for AARCH64 SVE targets currently. Yes, I agree with you to add the SuperWordMaxVectorSize reference. Thanks! I will change it. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Mon Jun 20 01:11:14 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 01:11:14 GMT Subject: Integrated: 8288397: AArch64: Fix register issues in SVE backend match rules In-Reply-To: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> References: <0vPwXBEnXX_w1358C7v4JCBZ_4uIGxokASDSkghGQS0=.01fa04ba-1213-4105-9734-efea6ff6293e@github.com> Message-ID: On Wed, 15 Jun 2022 09:40:52 GMT, Xiaohong Gong wrote: > There are register usage issues in the sve backend match rules, which made the two added jtreg tests fail. > > The predicated vector "`not`" rules didn't use the same register for "`src`" and "`dst`", which is necessary to make sure the inactive lanes in "`dst`" save the same elements as "`src`". This patch fixes the rules by using the same register for "`dst`" and "`src`". > > And the input idx register in "`gatherL/scatterL`" rules was overwritten by the first unpack instruction. The same issue also existed in the partial and predicated gatherL/scatterL rules. This patch fixes them by saving the unpack results into a temp register and use it as the index for gather/scatter. This pull request has now been integrated. Changeset: ae030bcb Author: Xiaohong Gong URL: https://git.openjdk.org/jdk19/commit/ae030bcbc53fdfcfb748ae1e47e660f698b3fcb7 Stats: 334 lines in 4 files changed: 274 ins; 0 del; 60 mod 8288397: AArch64: Fix register issues in SVE backend match rules Reviewed-by: njian, ngasson ------------- PR: https://git.openjdk.org/jdk19/pull/17 From xgong at openjdk.org Mon Jun 20 01:19:01 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 01:19:01 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 09:54:17 GMT, Jatin Bhateja wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Address review comments, revert changes for gatherL/scatterL rules >> - Merge branch 'jdk:master' into JDK-8286941 >> - Revert transformation from MaskAll to VectorMaskGen, address review comments >> - 8286941: Add mask IR for partial vector operations for ARM SVE > > src/hotspot/share/opto/vectornode.cpp line 1669: > >> 1667: if (Matcher::vector_needs_partial_operations(this, vt)) { >> 1668: return VectorNode::try_to_gen_masked_vector(phase, this, vt); >> 1669: } > > This is a parent node of TrueCount/FirstTrue/LastTrue and MaskToLong which perform mask querying operation on concrete predicate operands, a transformation here looks redundant to me. The main reason to add the transformation here is: the FirstTrue needs the reference to the real vector length for SVE, that we need to generate a predicate when the vector length is smaller than the max vector size. Please check the changes of `partial_op_sve_needed` in aarch64_sve.ad. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Mon Jun 20 01:24:03 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 01:24:03 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 09:12:09 GMT, Jatin Bhateja wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Address review comments, revert changes for gatherL/scatterL rules >> - Merge branch 'jdk:master' into JDK-8286941 >> - Revert transformation from MaskAll to VectorMaskGen, address review comments >> - 8286941: Add mask IR for partial vector operations for ARM SVE > > src/hotspot/share/opto/vectornode.cpp line 1013: > >> 1011: } >> 1012: } >> 1013: return LoadVectorNode::Ideal(phase, can_reshape); > > These predicated nodes are concrete ones with fixed species and carry user specified mask, I am not clear why do we need a mask re-computation for predicated nodes. > > Higher lanes of predicated operand should already be zero and mask attached to predicated node should be correct by construction, since mask lane count is always equal to vector lane count. Actually we don't need to add an additional mask for these masked nodes for SVE. Please refer to the `partial_op_sve_needed` method in `aarch64_sve.ad`, which lists all the ops that needs this transformation. Thanks! `LoadVectorNode::Ideal(phase, can_reshape)` just because `LoadVectorMaskedNode` derived from `LoadVectorNode`, that it may derive its other transformations in future or for other platforms. > src/hotspot/share/opto/vectornode.cpp line 1033: > >> 1031: } >> 1032: } >> 1033: return StoreVectorNode::Ideal(phase, can_reshape); > > Same as above. Same as `LoadVectorMaskedNode`, `StoreVectorMaskedNode` doesn't need the transformation as well. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Mon Jun 20 02:04:00 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 02:04:00 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 01:10:38 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectornode.cpp line 864: >> >>> 862: // Generate a vector mask for vector operation whose vector length is lower than the >>> 863: // hardware supported max vector length. >>> 864: if (vt->length_in_bytes() < MaxVectorSize) { >> >> For completeness, length comparison check can be done against MIN(SuperWordMaxVectorSize, MaxVectorSize). >> Even though SuperWordMaxVector differs from MaxVectorSize only for certain X86 targets and this control flow is only executed for AARCH64 SVE targets currently. > > Yes, I agree with you to add the SuperWordMaxVectorSize reference. Thanks! I will change it. I'm afraid that we cannot directly use `MIN(SuperWordMaxVectorSize, MaxVectorSize)` here. As I know `SuperWordMaxVectorSize` is used to control the max vector limit specially for auto-vectorization. It should not have any influence to the VectorAPI max vector size. So if the supported max vector size for VetorAPI is larger than auto-vectorization, the transformation will be influenced. For example, if a SVE hardware supported 128-bytes max vector size, but the `SuperWordMaxVectorSize` is 64, the predicate will not be generated for vectors whose vector size is smaller than 128-bytes. And I think x86 also has the similar issue. I think we'd better need a unit method to compute the `max_vector_size` that can handle the differences for superword and VectorAPI. And then all the orignal max_vector_size should be replaced with it. WDYT? ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Mon Jun 20 02:19:45 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 02:19:45 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 01:59:25 GMT, Xiaohong Gong wrote: >> Yes, I agree with you to add the SuperWordMaxVectorSize reference. Thanks! I will change it. > > I'm afraid that we cannot directly use `MIN(SuperWordMaxVectorSize, MaxVectorSize)` here. As I know `SuperWordMaxVectorSize` is used to control the max vector limit specially for auto-vectorization. It should not have any influence to the VectorAPI max vector size. So if the supported max vector size for VetorAPI is larger than auto-vectorization, the transformation will be influenced. For example, if a SVE hardware supported 128-bytes max vector size, but the `SuperWordMaxVectorSize` is 64, the predicate will not be generated for vectors whose vector size is smaller than 128-bytes. And I think x86 also has the similar issue. I think we'd better need a unit method to compute the `max_vector_size` that can handle the differences for superword and VectorAPI. And then all the orignal max_vector_size should be replaced with it. WDYT? BTW, the max vector size used here should be referenced from the hardware supported max vector size, which should be `MaxVectorSize` for SVE. For those vectors whose vector size is `SuperWordMaxVectorSize`, but smaller than the hardware supported max size, we still need to generate a predicate for them to make sure the results are right. So using `MaxVectorSize` is necessary here. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From jbhateja at openjdk.org Mon Jun 20 03:15:47 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 20 Jun 2022 03:15:47 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: <2w5U1U-s5HQaY0gXv4R2QBfQolE3mM__gQVJreldWaQ=.2763cc8a-86eb-4372-b47b-5d820018b739@github.com> On Mon, 20 Jun 2022 02:16:31 GMT, Xiaohong Gong wrote: >> I'm afraid that we cannot directly use `MIN(SuperWordMaxVectorSize, MaxVectorSize)` here. As I know `SuperWordMaxVectorSize` is used to control the max vector limit specially for auto-vectorization. It should not have any influence to the VectorAPI max vector size. So if the supported max vector size for VetorAPI is larger than auto-vectorization, the transformation will be influenced. For example, if a SVE hardware supported 128-bytes max vector size, but the `SuperWordMaxVectorSize` is 64, the predicate will not be generated for vectors whose vector size is smaller than 128-bytes. And I think x86 also has the similar issue. I think we'd better need a unit method to compute the `max_vector_size` that can handle the differences for superword and VectorAPI. And then all the orignal max_vector_size should be replaced with it. WDYT? > > BTW, the max vector size used here should be referenced from the hardware supported max vector size, which should be `MaxVectorSize` for SVE. For those vectors whose vector size is `SuperWordMaxVectorSize`, but smaller than the hardware supported max size, we still need to generate a predicate for them to make sure the results are right. So using `MaxVectorSize` is necessary here. VectorNode::try_to_gen_masked_vector is a common routine which will get called during idealization of all the child vector nodes (generated either by autovectorizer or vectorAPI) unless child overrides its ideal routine. Will this not result into out of memory writes for storeVector since autovectorizer works under influence of SuperWordMaxVectorSize which could be less than MaxVecotSize but your mask computation is only considering MaxVectorSize, hence may generate wider masks than desired. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Mon Jun 20 03:31:44 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 03:31:44 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: <2w5U1U-s5HQaY0gXv4R2QBfQolE3mM__gQVJreldWaQ=.2763cc8a-86eb-4372-b47b-5d820018b739@github.com> References: <2w5U1U-s5HQaY0gXv4R2QBfQolE3mM__gQVJreldWaQ=.2763cc8a-86eb-4372-b47b-5d820018b739@github.com> Message-ID: On Mon, 20 Jun 2022 03:12:24 GMT, Jatin Bhateja wrote: >> BTW, the max vector size used here should be referenced from the hardware supported max vector size, which should be `MaxVectorSize` for SVE. For those vectors whose vector size is `SuperWordMaxVectorSize`, but smaller than the hardware supported max size, we still need to generate a predicate for them to make sure the results are right. So using `MaxVectorSize` is necessary here. > > VectorNode::try_to_gen_masked_vector is a common routine which will get called during idealization of all the child vector nodes (generated either by autovectorizer or vectorAPI) unless child overrides its ideal routine. > > Will this not result into out of memory writes for storeVector since autovectorizer works under influence of SuperWordMaxVectorSize which could be less than MaxVecotSize but your mask computation is only considering MaxVectorSize, hence may generate wider masks than desired. The mask is generated based on the real vector length, which guarantees only the vector lanes that inside of its valid vector length will do the operations. Higher lanes do nothing. So if a VectorStore has a vector length which equals to `SuperWordMaxVectorSize`, but the MaxVectorSize is larger than the `SuperWordMaxVectorSize`, a mask will be generated based on the real length `SuperWordMaxVectorSize`. And only lanes under the `SuperWordMaxVectorSize` will be stored. So I cannot see the influence on the out of memory issue. Could you please give an example apart of x86 cases? Thanks so much! BTW, I think flag `SuperWordMaxVectorSize` should be equal to `MaxVectorSize` by default for other platforms instead of 64. If a platform (liken SVE) supports >64 bytes max vector size for auto-vectorization, it can only generate the vectors with 64 bytes vector size. This seems unreasonable. So do you have any plan refactoring this flag, or removing the usages for it in future? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From jbhateja at openjdk.org Mon Jun 20 04:05:02 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 20 Jun 2022 04:05:02 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: <2w5U1U-s5HQaY0gXv4R2QBfQolE3mM__gQVJreldWaQ=.2763cc8a-86eb-4372-b47b-5d820018b739@github.com> Message-ID: On Mon, 20 Jun 2022 03:28:05 GMT, Xiaohong Gong wrote: > The mask is generated based on the real vector length, which guarantees only the vector lanes that inside of its valid vector length will do the operations. Higher lanes do nothing. So if a VectorStore has a vector length which equals to `SuperWordMaxVectorSize`, but the MaxVectorSize is larger than the `SuperWordMaxVectorSize`, a mask will be generated based on the real length `SuperWordMaxVectorSize`. And only lanes under the `SuperWordMaxVectorSize` will be stored. So I cannot see the influence on the out of memory issue. Could you please give an example apart of x86 cases? Thanks so much! > Correct, got it > BTW, I think flag `SuperWordMaxVectorSize` should be equal to `MaxVectorSize` by default for other platforms instead of 64. If a platform (liken SVE) supports >64 bytes max vector size for auto-vectorization, it can only generate the vectors with 64 bytes vector size. This seems unreasonable. So do you have any plan refactoring this flag, or removing the usages for it in future? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From thartmann at openjdk.org Mon Jun 20 05:39:03 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 05:39:03 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 [v2] In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 08:46:39 GMT, Christian Hagedorn wrote: >> [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: >> >> "(x + x) << c0" into "x << (c0 + 1)" >> >> However, in the long shift case, this transformation is not correct if `c0` is 63: >> >> >> (x + x) << 63 = 2x << 63 >> >> while >> >> (x + x) << 63 --transform--> x << 64 = x << 0 = x >> >> which is not the same. For example, if `x = 1`: >> >> 2x << 63 = 2 << 63 = 0 != 1 >> >> This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). >> >> Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: >> >> https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!JWPzfSx6w6SrUyFy8xTwioPUJKolldgkYbjgaHAQigbBdFDqXDDLnS7FXS7pcF7i_wc-unEfP3EP_gRoSGk8CKqRj9ywZ4GZAA$ >> >> The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with two additional commits since the last revision: > > - typo > - change assert to comment Nice analysis. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/29 From thartmann at openjdk.org Mon Jun 20 05:52:54 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 05:52:54 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 22:37:28 GMT, Dean Long wrote: > The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. What instruction will the zero-shift be matched with then? test/hotspot/jtreg/compiler/codegen/ShiftByZero.java line 29: > 27: * @summary Test shift by 0 > 28: * @library /test/lib > 29: * @run main compiler.codegen.ShiftByZero I don't think these two lines are needed. test/hotspot/jtreg/compiler/codegen/ShiftByZero.java line 69: > 67: > 68: public static void main(String[] strArr) { > 69: for (int i = 0; i < 20_000; i++ ) { Suggestion: for (int i = 0; i < 20_000; i++) { ------------- PR: https://git.openjdk.org/jdk19/pull/40 From thartmann at openjdk.org Mon Jun 20 05:55:00 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 05:55:00 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom [v2] In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 13:51:59 GMT, Erik ?sterlund wrote: >> The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. >> If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Add comment Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/8673 From thartmann at openjdk.org Mon Jun 20 06:01:54 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 06:01:54 GMT Subject: RFR: 8287801: Fix test-bugs related to stress flags In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 14:44:19 GMT, Emanuel Peter wrote: > I recently ran many tests with additional stress flags. While there were a few bugs I found, most of the issues were test bugs. > > I found a list of tests that are problematic with specific stress flags, I adjusted the tests so that they can now be run with the flag. Often I just fix the flag at the default value, so that setting it from the outside does not affect the test. > > Below I explain for each test how and why I adjusted the test. > > - test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyNoInitDeopt.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Disabling traps by setting limit of traps to zero, means some optimistic optimizations are not made, and can therefore not lead to deoptimization. The test expects deoptimization due to traps, so we need to have them on. > - test/hotspot/jtreg/compiler/c2/cr7200264/TestDriver.java > - used by TestSSE2IntVect.java and TestSSE4IntVect.java > - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE2IntVect.java > - Problem Flags: `-XX:StressLongCountedLoop=2000000` > - Test checks the IR, and if we convert loops to long loops, some operations will not show up anymore, the test fails. Disable StressLongCountedLoop. > - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE4IntVect.java > - See TestSSE2IntVect.java > - test/hotspot/jtreg/compiler/c2/irTests/blackhole/BlackholeStoreStoreEATest.java > - Problem Flags: `-XX:-UseTLAB` > - No thread local allocation (TLAB) means IR is changed. Test checks for MemBarStoreStore, which is missing without TLAB. Solution: always have TLAB on. > - test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java > - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` > - Messes with recompilation, makes assert fail that expects recompilation. Must disable flag. > - test/hotspot/jtreg/compiler/cha/DefaultRootMethod.java > - see AbstractRootMethod.java > - test/hotspot/jtreg/compiler/intrinsics/klass/CastNullCheckDroppingsTest.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps, otherwise some optimistic optimisations are not made, and then they also are not trapped and deoptimized. > - Problem Flags: `-XX:TypeProfileLevel=222` > - Profiling also messes with optimizations / deoptimization. > - Problem Flags: `-XX:+StressReflectiveCode` > - Messes with types at allocation, which messes with optimizations. > - Problem Flags: `-XX:-UncommonNullCast` > - Is required for trapping in null checks. > - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` > - Messes with inlining / optimization - turn it off. > - test/hotspot/jtreg/compiler/jvmci/compilerToVM/IsMatureVsReprofileTest.java > - Problem Flags: `-XX:Tier4BackEdgeThreshold=1 -Xbatch -XX:-TieredCompilation` > - Lead to OSR compilation in loop calling `testMethod`, which is expected to be compiled. But with the OSR compilation, that function is inlined, and never compiled. Solution was to make sure we only compile `testMethod`. > - test/hotspot/jtreg/compiler/jvmci/compilerToVM/ReprofileTest.java > - Problem Flags: `-XX:TypeProfileLevel=222` > - Changing profile flags messes with test, which assumes default behavior. > - test/hotspot/jtreg/compiler/profiling/TestTypeProfiling.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestLongRangeCheck.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckSmearing.java > - Problem Flags: `-XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion` > - Test expects to be run at compilation tier 4 / C2, so must fix it at that in requirement. > - test/hotspot/jtreg/compiler/uncommontrap/Decompile.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0 -XX:PerBytecodeTrapLimit=0` > - The tests if we trap and decompile after we call member functions of a class that we did not use before. If we disable traps, then internally it uses a virtual call, and no deoptimization is required - but the test expects trapping and deoptimization. Solution: set trap limits to default. > - Problem Flags: `-XX:TypeProfileLevel=222` > - Changing profiling behavior also messes with deoptimization - disable it. > - test/hotspot/jtreg/compiler/uncommontrap/TestUnstableIfTrap.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Test expects traps, so we must ensure the limits are at default. Nice analysis. Please make sure to run all affected tests with a product VM build. test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java line 41: > 39: * -Xbatch -Xmixed -XX:+WhiteBoxAPI > 40: * -XX:-TieredCompilation > 41: * -XX:-StressMethodHandleLinkerInlining Wouldn't this fail with a product VM build because the flag is debug only? ------------- Changes requested by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9186 From thartmann at openjdk.org Mon Jun 20 06:24:55 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 06:24:55 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus [v2] In-Reply-To: References: Message-ID: <64QOdlWwt_6H8mOPHB6AJXgGFlhf-24OX8lrBBybTq4=.5406b1d7-681f-4e8f-89b7-853174be9b96@github.com> On Thu, 16 Jun 2022 10:22:33 GMT, Roberto Casta?eda Lozano wrote: >> This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://urldefense.com/v3/__https://netbeans.apache.org/tutorials/nbm-selection-2.html__;!!ACWV5N9M2RV99hQ!OgLYtNnI1jrRRnn1yGPthyOWlX4FftUeaXiqy8_Sf6-iZjTgnljiBUh1zG1mfcSrQ5u8u7Zgw28xOHQX-vbcGF_OJ-MYDVYGLg$ ). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. >> >> Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. > > Roberto Casta?eda Lozano has updated the pull request incrementally with three additional commits since the last revision: > > - Highlight active graph when the Outline window is re-opened > - Avoid unnecessary setting of 'result' to null > - Wait for last graph update before highlighting it Looks reasonable. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9167 From thartmann at openjdk.org Mon Jun 20 06:26:55 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 06:26:55 GMT Subject: RFR: 8288480: IGV: toolbar action is not applied to the focused graph In-Reply-To: References: Message-ID: On Wed, 15 Jun 2022 13:45:53 GMT, Roberto Casta?eda Lozano wrote: > When multiple graphs are displayed simultaneously in split windows, the following toolbar actions are always applied to the same graph, regardless of which graph window is focused: > > - search nodes and blocks > - extract node > - hide node > - show all nodes > - zoom in > - zoom out > > This changeset ensures that each of the above actions is only applied within its corresponding graph window. This is achieved by applying the actions to the graph window that is currently activated (`EditorTopComponent.getRegistry().getActivated()`) instead of the first matching occurrence in `WindowManager.getDefault().getModes()`. > > The changeset makes it practical, for example, to explore different views of the same graph simultaneously, as illustrated here: > > ![multi-view](https://urldefense.com/v3/__https://user-images.githubusercontent.com/8792647/173841115-084c6396-3843-4d9b-9951-f93c932100c3.png__;!!ACWV5N9M2RV99hQ!OmF2K6J-4GLiyGXzQAJ4VM8NJArA-0buhGCR0JBD4GOthyPnG1ZlVd-T7Ta-T6Rv0LdQ7FzuCOmXBDyM6f6o4hiDETfeuviOow$ ) > > Tested manually by triggering the above actions within multiple split graph windows and asserting that they are only applied to their corresponding graphs. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9169 From haosun at openjdk.org Mon Jun 20 06:30:57 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 20 Jun 2022 06:30:57 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 22:37:28 GMT, Dean Long wrote: > The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. Instead of introducing `immI_positive`, I wonder if we can generate `orr dst src` for zero shift count, just as the SVE part does. E.g., https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad*L3667__;Iw!!ACWV5N9M2RV99hQ!KcHFTzRcO_-LNedIbO18sCUNPiNl3ng4zcnSpE0soDdol2nDlGco-87Q0ORUydkcGL_gi_2g2bpWr0GSlQ4eB7Xyo-BeuA$ ------------- PR: https://git.openjdk.org/jdk19/pull/40 From eosterlund at openjdk.org Mon Jun 20 06:46:19 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 20 Jun 2022 06:46:19 GMT Subject: Integrated: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: <3iQJN-XBdCHoJe0Rrwp9ZVAMd0MFzeywtWr6abrgNp0=.273270bc-16e1-45a1-b3dc-7896375b9d74@github.com> On Thu, 12 May 2022 07:30:39 GMT, Erik ?sterlund wrote: > The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. > If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. This pull request has now been integrated. Changeset: 7d4df6a8 Author: Erik ?sterlund URL: https://git.openjdk.org/jdk/commit/7d4df6a83f6333e0e73686b807ee5d4b0ac10cd2 Stats: 8 lines in 1 file changed: 7 ins; 0 del; 1 mod 8284404: Too aggressive sweeping with Loom Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/8673 From chagedorn at openjdk.org Mon Jun 20 06:48:41 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 20 Jun 2022 06:48:41 GMT Subject: RFR: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 [v2] In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 08:46:39 GMT, Christian Hagedorn wrote: >> [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: >> >> "(x + x) << c0" into "x << (c0 + 1)" >> >> However, in the long shift case, this transformation is not correct if `c0` is 63: >> >> >> (x + x) << 63 = 2x << 63 >> >> while >> >> (x + x) << 63 --transform--> x << 64 = x << 0 = x >> >> which is not the same. For example, if `x = 1`: >> >> 2x << 63 = 2 << 63 = 0 != 1 >> >> This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). >> >> Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: >> >> https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!Mjzxwg2Ga1QzRWWKOjZdf-KvNiE3WhWgTe-qL32rf523BFIdQsAC--rUIXvaR1UeWhYoz1DG_sf9LNF63Xoa_SAuB8OxFJk1eg$ >> >> The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with two additional commits since the last revision: > > - typo > - change assert to comment Thanks Tobias for your review! ------------- PR: https://git.openjdk.org/jdk19/pull/29 From chagedorn at openjdk.org Mon Jun 20 06:51:44 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 20 Jun 2022 06:51:44 GMT Subject: Integrated: 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 15:27:42 GMT, Christian Hagedorn wrote: > [JDK-8278114](https://bugs.openjdk.org/browse/JDK-8278114) added the following transformation for integer and long left shifts: > > "(x + x) << c0" into "x << (c0 + 1)" > > However, in the long shift case, this transformation is not correct if `c0` is 63: > > > (x + x) << 63 = 2x << 63 > > while > > (x + x) << 63 --transform--> x << 64 = x << 0 = x > > which is not the same. For example, if `x = 1`: > > 2x << 63 = 2 << 63 = 0 != 1 > > This optimization does not account for the fact that `x << 64` is the same as `x << 0 = x`. According to the [Java spec, chapter 15.19](https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19), we only consider the six lowest-order bits of the right-hand operand (i.e. `"right-hand operand" & 0b111111`). Therefore, `x << 64` is the same as `x << 0` (`64 = 0b10000000 & 0b0111111 = 0`). > > Integer shifts are not affected because we do not apply this transformation if `c0 >= 16`: > > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/729164f53499f146579a48ba1b466c687802f330/src/hotspot/share/opto/mulnode.cpp*L810-L817__;Iw!!ACWV5N9M2RV99hQ!JoKrvTUy8P8ZAJa16WfePJcdkcry_U_oL1aMzsEaHosA5qncXqDAd0_A3X38tEVTi9JEJcxnzMOp-D1ZNZSJB7mUfmoJRfX_sQ$ > > The fix I propose is to not apply this optimization for long left shifts if `c0 == 63`. I've added an additional sanity assertion for integer left shifts just in case this optimization is moved at some point and ending up outside the check for `con < 16`. > > Thanks, > Christian This pull request has now been integrated. Changeset: ed714af8 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk19/commit/ed714af854d79fb2b47849f6efdf0c26686b58b3 Stats: 39 lines in 2 files changed: 37 ins; 0 del; 2 mod 8288564: C2: LShiftLNode::Ideal produces wrong result after JDK-8278114 Reviewed-by: kvn, iveresov, thartmann ------------- PR: https://git.openjdk.org/jdk19/pull/29 From rcastanedalo at openjdk.org Mon Jun 20 07:29:58 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 20 Jun 2022 07:29:58 GMT Subject: RFR: 8288480: IGV: toolbar action is not applied to the focused graph In-Reply-To: References: <5HwYpbKPrqFGn2_qlqJeLdYWC1ZHAgzZzcjG5c6wiIk=.e537ab9b-9d0e-4c2b-9b15-028dc0f97c10@github.com> Message-ID: On Fri, 17 Jun 2022 09:24:44 GMT, Roberto Casta?eda Lozano wrote: > Looks good. Thanks, Tobias! ------------- PR: https://git.openjdk.org/jdk/pull/9169 From rcastanedalo at openjdk.org Mon Jun 20 07:29:56 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 20 Jun 2022 07:29:56 GMT Subject: RFR: 8263384: IGV: Outline should highlight the Graph that has focus [v2] In-Reply-To: <64QOdlWwt_6H8mOPHB6AJXgGFlhf-24OX8lrBBybTq4=.5406b1d7-681f-4e8f-89b7-853174be9b96@github.com> References: <64QOdlWwt_6H8mOPHB6AJXgGFlhf-24OX8lrBBybTq4=.5406b1d7-681f-4e8f-89b7-853174be9b96@github.com> Message-ID: On Mon, 20 Jun 2022 06:22:52 GMT, Tobias Hartmann wrote: > Looks reasonable. Thanks for reviewing, Tobias! ------------- PR: https://git.openjdk.org/jdk/pull/9167 From rcastanedalo at openjdk.org Mon Jun 20 07:29:59 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 20 Jun 2022 07:29:59 GMT Subject: Integrated: 8263384: IGV: Outline should highlight the Graph that has focus In-Reply-To: References: Message-ID: On Wed, 15 Jun 2022 12:47:54 GMT, Roberto Casta?eda Lozano wrote: > This changeset eases navigation within and across graph groups by highlighting the focused graph in the Outline window. If the user changes the focus to another graph window, or moves to the previous or next graph within the same window, the newly focused graph is automatically highlighted in the Outline window. This is implemented by maintaining a static map from opened graphs to their corresponding [NetBeans nodes](https://urldefense.com/v3/__https://netbeans.apache.org/tutorials/nbm-selection-2.html__;!!ACWV5N9M2RV99hQ!NZCgdKA-E7jkBi4T5SeOA6l3e_FUhSsRVFURKKtcs_IDtTzGBwcADgcQs8rvnj1Z_8yXg3XloM_AKtVQzJpCpoNT3gMr3QnYCVp-Ug$ ). The Outline window uses the map to select, on a graph focus change, the NetBeans node of the newly focused graph that should be highlighted. > > Tested manually by opening simultaneously tens of graphs from different groups and switching the focus randomly. This pull request has now been integrated. Changeset: 02da5f99 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/02da5f9970ae02e0a67a8bae7cddefe9f3a17ce4 Stats: 64 lines in 2 files changed: 58 ins; 0 del; 6 mod 8263384: IGV: Outline should highlight the Graph that has focus Reviewed-by: xliu, chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9167 From thartmann at openjdk.org Mon Jun 20 07:36:03 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 07:36:03 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 20:42:22 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/2401/files*diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea__;Iw!!ACWV5N9M2RV99hQ!LBkYuxRT3i7yjqe-yTfuJfpv7Mr-Jt-kEMHUHsWt9TCrYawzpPy2KOpBjIu5brE93J935ys3MGr69j87NHBvnWARACQsh_YURg$ ), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > monior change for code style. Looks good overall. Some comments/questions: - Why can't we remove traps that have been modified? - I'm wondering how useful `Compile::print_statistics()` really is. Is it worth extending it? Is anyone using it? - Do you need to check for unstable if traps in `Node::destruct`? src/hotspot/share/opto/compile.cpp line 1929: > 1927: int next_bci = trap->next_bci(); > 1928: > 1929: if (next_bci != -1 && !trap->modified()) { How can it be already modified? We are only processing each trap once, right? src/hotspot/share/opto/ifnode.cpp line 842: > 840: if (!igvn->C->too_many_traps(dom_method, dom_bci, Deoptimization::Reason_unstable_fused_if) && > 841: !igvn->C->too_many_traps(dom_method, dom_bci, Deoptimization::Reason_range_check) && > 842: igvn->C->remove_unstable_if_trap(dom_unc)) { This should be moved to `IfNode::merge_uncommon_traps`. src/hotspot/share/opto/parse.hpp line 607: > 605: > 606: // Specialized uncommon_trap of unstable_if, we have 2 optimizations for them: > 607: // 1. suppress trivial Unstable_If traps Where is this done? src/hotspot/share/opto/parse.hpp line 609: > 607: // 1. suppress trivial Unstable_If traps > 608: // 2. use next_bci of _path to update live locals. > 609: class UnstableIfTrap { What about moving this information into `CallStaticJavaNode`? src/hotspot/share/opto/parse.hpp line 622: > 620: } > 621: > 622: // The starting point of the pruned block, where control should go Suggestion: // The starting point of the pruned block, where control goes src/hotspot/share/opto/parse.hpp line 636: > 634: } > 635: > 636: Parse::Block* path() const { This method is not used. src/hotspot/share/opto/parse.hpp line 643: > 641: // if _path has only one predecessor, it is trivial if this block is small(1~2 bytecodes) > 642: // or if _path has more than one predecessor and has been parsed, _unc does not mask out any real code. > 643: bool is_trivial() const { But these properties are not checked by the method, right? Also, the code is only used in debug, should it be guarded? ------------- Changes requested by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/8545 From rcastanedalo at openjdk.org Mon Jun 20 07:38:57 2022 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 20 Jun 2022 07:38:57 GMT Subject: Integrated: 8288480: IGV: toolbar action is not applied to the focused graph In-Reply-To: References: Message-ID: On Wed, 15 Jun 2022 13:45:53 GMT, Roberto Casta?eda Lozano wrote: > When multiple graphs are displayed simultaneously in split windows, the following toolbar actions are always applied to the same graph, regardless of which graph window is focused: > > - search nodes and blocks > - extract node > - hide node > - show all nodes > - zoom in > - zoom out > > This changeset ensures that each of the above actions is only applied within its corresponding graph window. This is achieved by applying the actions to the graph window that is currently activated (`EditorTopComponent.getRegistry().getActivated()`) instead of the first matching occurrence in `WindowManager.getDefault().getModes()`. > > The changeset makes it practical, for example, to explore different views of the same graph simultaneously, as illustrated here: > > ![multi-view](https://urldefense.com/v3/__https://user-images.githubusercontent.com/8792647/173841115-084c6396-3843-4d9b-9951-f93c932100c3.png__;!!ACWV5N9M2RV99hQ!OI7MWyhgPcxaofvuVpr2naXNNL3ZQX9HOoggg8sU36CrGTDyiffsKqU0ihtxbEemR8nXMH-BHz0m5wtdgCflI3jc53wNVcHEXt4FzQ$ ) > > Tested manually by triggering the above actions within multiple split graph windows and asserting that they are only applied to their corresponding graphs. This pull request has now been integrated. Changeset: f62b2bd9 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/f62b2bd9cda952b205ee03151cc58c95f588a742 Stats: 8 lines in 1 file changed: 0 ins; 7 del; 1 mod 8288480: IGV: toolbar action is not applied to the focused graph Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9169 From thartmann at openjdk.org Mon Jun 20 07:49:57 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 07:49:57 GMT Subject: RFR: 8288022: c2: Transform (CastLL (AddL into (AddL (CastLL when possible In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 08:26:47 GMT, Roland Westrelin wrote: > This implements a transformation that already exists for CastII and > ConvI2L and helps code generation. The tricky part is that: > > (CastII (AddI into (AddI (CastII > > is performed by first computing the bounds of the type of the AddI. To > protect against overflow, jlong variables are used. With CastLL/AddL > nodes there's no larger integer type to promote the bounds to. As a > consequence the logic in the patch explicitly tests for overflow. That > logic is shared by the int and long cases. The previous logic for the > int cases that promotes values to long is used as verification. > > This patch also widens the type of CastLL nodes after loop opts the > way it's done for CastII/ConvI2L to allow commoning of nodes. > > This was observed to help with Memory Segment micro benchmarks. Looks correct. A second review would be good. test/hotspot/jtreg/compiler/c2/irTests/TestPushAddThruCast.java line 68: > 66: } > 67: > 68: } Suggestion: } } ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9139 From xgong at openjdk.org Mon Jun 20 07:58:34 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 07:58:34 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations Message-ID: This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: (AndV v (Replicate m1)) => v (AndV v (Replicate zero)) => Replicate zero (AndV v v) => v (OrV v (Replicate m1)) => Replicate m1 (OrV v (Replicate zero)) => v (OrV v v) => v (XorV v v) => Replicate zero where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". ------------- Commit messages: - 8288294: [vector] Add Identity/Ideal transformations for vector logic operations Changes: https://git.openjdk.org/jdk/pull/9211/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9211&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288294 Stats: 640 lines in 4 files changed: 630 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/9211.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9211/head:pull/9211 PR: https://git.openjdk.org/jdk/pull/9211 From thartmann at openjdk.org Mon Jun 20 08:00:08 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 08:00:08 GMT Subject: RFR: 8288467: remove memory_operand assert for spilled instructions In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 09:32:40 GMT, Emanuel Peter wrote: > In [JDK-8282555](https://bugs.openjdk.org/browse/JDK-8282555) I added this assert, because on x64 this seems to always hold. But it turns out there are instructions on x86 (32bit) that violate this assumption. > > **Why it holds on x64** > It seems we only ever do one read or one write per instruction. Instructions with multiple memory operands are extremely rare. > 1. we spill from register to memory: we land in the if case, where the cisc node has an additional input slot for the memory edge. > 2. we spill from register to stackSlot: no additional input slot is reserved, we land in else case and add an additional precedence edge. > > **Why it is violated on x86 (32bit)** > We have additional cases that land in the else case. For example spilling `src1` from `addFPR24_reg_mem` to `addFPR24_mem_cisc`. > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10325-L10327__;Iw!!ACWV5N9M2RV99hQ!KP0-SKIjsleD5vwpg0AuGgQV7lopMaKXRcDbI06X2m1H5WDyhgY5RE8x5KQ0ZkTokSLHg5mWjq-N1YW1QVvPhzYBTgbHrSvQog$ https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10368-L10370__;Iw!!ACWV5N9M2RV99hQ!KP0-SKIjsleD5vwpg0AuGgQV7lopMaKXRcDbI06X2m1H5WDyhgY5RE8x5KQ0ZkTokSLHg5mWjq-N1YW1QVvPhzYBTgardIcWcQ$ > We land in the else case, because both have 2 inputs, thus `oper_input_base() == 2`. > And both have memory operands, so the assert must fail. > > **Solutions** > 1. Remove the Assert, as it is incorrect. > 2. Extend the assert to be correct. > - case 1: reg to mem spill, where we have a reserved input slot in cisc for memory edge > - case 2: reg to stackSlot spill, where both mach and cisc have no memory operand. > - other cases, with various register, stackSlot and memory inputs and outputs. We would have to find a general rule, and test it properly, which is not trivial because getting registers to spill is not easy to precisely provoke. > 3. Have platform dependent asserts. But also this makes testing harder. > > For now I went with 1. as it is simple and as far as I can see correct. > > Running tests on x64 (should not fail). **I need someone to help me with testing x86 (32bit)**. I only verified the reported test failure with a 32bit build. Looks reasonable to me. It would be good if @jatin-bhateja could also have a look. src/hotspot/share/opto/chaitin.cpp line 1740: > 1738: // operands, before and after spilling. > 1739: // (e.g. spilling "addFPR24_reg_mem" to "addFPR24_mem_cisc") > 1740: // In eigher case, there is no space in the inputs for the memory edge Suggestion: // In either case, there is no space in the inputs for the memory edge ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/33 From thartmann at openjdk.org Mon Jun 20 08:04:58 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 08:04:58 GMT Subject: RFR: 8280320: C2: Loop opts are missing during OSR compilation In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 21:29:41 GMT, Vladimir Ivanov wrote: > After [JDK-8272330](https://bugs.openjdk.org/browse/JDK-8272330), OSR compilations may completely miss loop optimizations pass due to misleading profiling data. The cleanup changed how profile counts are scaled and it had surprising effect on OSR compilations. > > For a long-running loop it's common to have an MDO allocated during the first invocation while running in the loop. Also, OSR compilation may be scheduled while running the very first method invocation. In such case, `MethodData::invocation_counter() == 0` while `MethodData::backedge_counter() > 0`. Before JDK-8272330 went in, `ciMethod::scale_count()` took into account both `invocation_counter()` and `backedge_counter()`. Now `MethodData::invocation_counter()` is taken by `ciMethod::scale_count()` as is and it forces all counts to be unconditionally scaled to `1`. > > It misleads `IdealLoopTree::beautify_loops()` to believe there are no hot > backedges in the loop being compiled and `IdealLoopTree::split_outer_loop()` > doesn't kick in thus effectively blocking any further loop optimizations. > > Proposed fix bumps `MethodData::invocation_counter()` from `0` to `1` and > enables `ciMethod::scale_count()` to report sane numbers. > > Testing: > - hs-tier1 - hs-tier4 Looks reasonable. Maybe change the comment to something like: `// invocation counter may be slightly off because MDO is only allocated after first invocation` src/hotspot/share/ci/ciMethodData.cpp line 255: > 253: } > 254: > 255: _state = (mdo->is_mature() ? mature_state : immature_state); I think the surrounding brackets should/could be removed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/38 From xgong at openjdk.org Mon Jun 20 08:07:58 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 20 Jun 2022 08:07:58 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v4] In-Reply-To: References: Message-ID: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Fix the ci build issue ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9037/files - new: https://git.openjdk.org/jdk/pull/9037/files/fc11338d..ba59b76e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.org/jdk/pull/9037 From epeter at openjdk.org Mon Jun 20 08:30:26 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 20 Jun 2022 08:30:26 GMT Subject: RFR: 8288467: remove memory_operand assert for spilled instructions [v2] In-Reply-To: References: Message-ID: > In [JDK-8282555](https://bugs.openjdk.org/browse/JDK-8282555) I added this assert, because on x64 this seems to always hold. But it turns out there are instructions on x86 (32bit) that violate this assumption. > > **Why it holds on x64** > It seems we only ever do one read or one write per instruction. Instructions with multiple memory operands are extremely rare. > 1. we spill from register to memory: we land in the if case, where the cisc node has an additional input slot for the memory edge. > 2. we spill from register to stackSlot: no additional input slot is reserved, we land in else case and add an additional precedence edge. > > **Why it is violated on x86 (32bit)** > We have additional cases that land in the else case. For example spilling `src1` from `addFPR24_reg_mem` to `addFPR24_mem_cisc`. > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10325-L10327__;Iw!!ACWV5N9M2RV99hQ!Ja7XHQ8ZO-anMMYwAeFOBe0xB-O89jTk4EoBX3gG31OH_x2LE0n41wAJ85n0nS_N5new2tsRnkwJA1sTJ9fh9rVqPXngoA$ https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10368-L10370__;Iw!!ACWV5N9M2RV99hQ!Ja7XHQ8ZO-anMMYwAeFOBe0xB-O89jTk4EoBX3gG31OH_x2LE0n41wAJ85n0nS_N5new2tsRnkwJA1sTJ9fh9rVHaV4ngA$ > We land in the else case, because both have 2 inputs, thus `oper_input_base() == 2`. > And both have memory operands, so the assert must fail. > > **Solutions** > 1. Remove the Assert, as it is incorrect. > 2. Extend the assert to be correct. > - case 1: reg to mem spill, where we have a reserved input slot in cisc for memory edge > - case 2: reg to stackSlot spill, where both mach and cisc have no memory operand. > - other cases, with various register, stackSlot and memory inputs and outputs. We would have to find a general rule, and test it properly, which is not trivial because getting registers to spill is not easy to precisely provoke. > 3. Have platform dependent asserts. But also this makes testing harder. > > For now I went with 1. as it is simple and as far as I can see correct. > > Running tests on x64 (should not fail). **I need someone to help me with testing x86 (32bit)**. I only verified the reported test failure with a 32bit build. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/chaitin.cpp fix typo Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/33/files - new: https://git.openjdk.org/jdk19/pull/33/files/d082aa91..951e91de Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=33&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=33&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/33.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/33/head:pull/33 PR: https://git.openjdk.org/jdk19/pull/33 From shade at openjdk.org Mon Jun 20 10:17:57 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 20 Jun 2022 10:17:57 GMT Subject: RFR: 8288467: remove memory_operand assert for spilled instructions [v2] In-Reply-To: References: Message-ID: <7C4757QXMGdk25Cnso5QAQFzjuqw8rXSQiJBI7zLquA=.756291a2-d189-49e4-939b-df0450448b0b@github.com> On Mon, 20 Jun 2022 08:30:26 GMT, Emanuel Peter wrote: >> In [JDK-8282555](https://bugs.openjdk.org/browse/JDK-8282555) I added this assert, because on x64 this seems to always hold. But it turns out there are instructions on x86 (32bit) that violate this assumption. >> >> **Why it holds on x64** >> It seems we only ever do one read or one write per instruction. Instructions with multiple memory operands are extremely rare. >> 1. we spill from register to memory: we land in the if case, where the cisc node has an additional input slot for the memory edge. >> 2. we spill from register to stackSlot: no additional input slot is reserved, we land in else case and add an additional precedence edge. >> >> **Why it is violated on x86 (32bit)** >> We have additional cases that land in the else case. For example spilling `src1` from `addFPR24_reg_mem` to `addFPR24_mem_cisc`. >> https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10325-L10327__;Iw!!ACWV5N9M2RV99hQ!PlIApTLkAE5W4acQXMpDA52of8Ad2AXgBs9rW1I3vnrQRb-Wiqy0_86YBZsqIDqHSyOeP28slkt56FuAJ7vHnACKmr7O$ https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10368-L10370__;Iw!!ACWV5N9M2RV99hQ!PlIApTLkAE5W4acQXMpDA52of8Ad2AXgBs9rW1I3vnrQRb-Wiqy0_86YBZsqIDqHSyOeP28slkt56FuAJ7vHnGNdoJzl$ >> We land in the else case, because both have 2 inputs, thus `oper_input_base() == 2`. >> And both have memory operands, so the assert must fail. >> >> **Solutions** >> 1. Remove the Assert, as it is incorrect. >> 2. Extend the assert to be correct. >> - case 1: reg to mem spill, where we have a reserved input slot in cisc for memory edge >> - case 2: reg to stackSlot spill, where both mach and cisc have no memory operand. >> - other cases, with various register, stackSlot and memory inputs and outputs. We would have to find a general rule, and test it properly, which is not trivial because getting registers to spill is not easy to precisely provoke. >> 3. Have platform dependent asserts. But also this makes testing harder. >> >> For now I went with 1. as it is simple and as far as I can see correct. >> >> Running tests on x64 (should not fail). **I need someone to help me with testing x86 (32bit)**. I only verified the reported test failure with a 32bit build. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/chaitin.cpp > > fix typo > > Co-authored-by: Tobias Hartmann Linux x86_32 fastdebug `tier1` and `tier2` pass. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk19/pull/33 From epeter at openjdk.org Mon Jun 20 10:40:54 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 20 Jun 2022 10:40:54 GMT Subject: RFR: 8287801: Fix test-bugs related to stress flags In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 05:57:28 GMT, Tobias Hartmann wrote: >> I recently ran many tests with additional stress flags. While there were a few bugs I found, most of the issues were test bugs. >> >> I found a list of tests that are problematic with specific stress flags, I adjusted the tests so that they can now be run with the flag. Often I just fix the flag at the default value, so that setting it from the outside does not affect the test. >> >> Below I explain for each test how and why I adjusted the test. >> >> - test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyNoInitDeopt.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Disabling traps by setting limit of traps to zero, means some optimistic optimizations are not made, and can therefore not lead to deoptimization. The test expects deoptimization due to traps, so we need to have them on. >> - test/hotspot/jtreg/compiler/c2/cr7200264/TestDriver.java >> - used by TestSSE2IntVect.java and TestSSE4IntVect.java >> - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE2IntVect.java >> - Problem Flags: `-XX:StressLongCountedLoop=2000000` >> - Test checks the IR, and if we convert loops to long loops, some operations will not show up anymore, the test fails. Disable StressLongCountedLoop. >> - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE4IntVect.java >> - See TestSSE2IntVect.java >> - test/hotspot/jtreg/compiler/c2/irTests/blackhole/BlackholeStoreStoreEATest.java >> - Problem Flags: `-XX:-UseTLAB` >> - No thread local allocation (TLAB) means IR is changed. Test checks for MemBarStoreStore, which is missing without TLAB. Solution: always have TLAB on. >> - test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java >> - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` >> - Messes with recompilation, makes assert fail that expects recompilation. Must disable flag. >> - test/hotspot/jtreg/compiler/cha/DefaultRootMethod.java >> - see AbstractRootMethod.java >> - test/hotspot/jtreg/compiler/intrinsics/klass/CastNullCheckDroppingsTest.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Need traps, otherwise some optimistic optimisations are not made, and then they also are not trapped and deoptimized. >> - Problem Flags: `-XX:TypeProfileLevel=222` >> - Profiling also messes with optimizations / deoptimization. >> - Problem Flags: `-XX:+StressReflectiveCode` >> - Messes with types at allocation, which messes with optimizations. >> - Problem Flags: `-XX:-UncommonNullCast` >> - Is required for trapping in null checks. >> - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` >> - Messes with inlining / optimization - turn it off. >> - test/hotspot/jtreg/compiler/jvmci/compilerToVM/IsMatureVsReprofileTest.java >> - Problem Flags: `-XX:Tier4BackEdgeThreshold=1 -Xbatch -XX:-TieredCompilation` >> - Lead to OSR compilation in loop calling `testMethod`, which is expected to be compiled. But with the OSR compilation, that function is inlined, and never compiled. Solution was to make sure we only compile `testMethod`. >> - test/hotspot/jtreg/compiler/jvmci/compilerToVM/ReprofileTest.java >> - Problem Flags: `-XX:TypeProfileLevel=222` >> - Changing profile flags messes with test, which assumes default behavior. >> - test/hotspot/jtreg/compiler/profiling/TestTypeProfiling.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Need traps to check for optimistic optimizations. >> - test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Need traps to check for optimistic optimizations. >> - test/hotspot/jtreg/compiler/rangechecks/TestLongRangeCheck.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Need traps to check for optimistic optimizations. >> - test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckSmearing.java >> - Problem Flags: `-XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion` >> - Test expects to be run at compilation tier 4 / C2, so must fix it at that in requirement. >> - test/hotspot/jtreg/compiler/uncommontrap/Decompile.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0 -XX:PerBytecodeTrapLimit=0` >> - The tests if we trap and decompile after we call member functions of a class that we did not use before. If we disable traps, then internally it uses a virtual call, and no deoptimization is required - but the test expects trapping and deoptimization. Solution: set trap limits to default. >> - Problem Flags: `-XX:TypeProfileLevel=222` >> - Changing profiling behavior also messes with deoptimization - disable it. >> - test/hotspot/jtreg/compiler/uncommontrap/TestUnstableIfTrap.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Test expects traps, so we must ensure the limits are at default. > > test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java line 41: > >> 39: * -Xbatch -Xmixed -XX:+WhiteBoxAPI >> 40: * -XX:-TieredCompilation >> 41: * -XX:-StressMethodHandleLinkerInlining > > Wouldn't this fail with a product VM build because the flag is debug only? Adding `-XX:+IgnoreUnrecognizedVMOptions` to 3 files, in other cases it is already there, so if a develop flag is used on a product build, that flag would simply be ignored. ------------- PR: https://git.openjdk.org/jdk/pull/9186 From epeter at openjdk.org Mon Jun 20 10:45:57 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 20 Jun 2022 10:45:57 GMT Subject: RFR: 8287801: Fix test-bugs related to stress flags [v2] In-Reply-To: References: Message-ID: > I recently ran many tests with additional stress flags. While there were a few bugs I found, most of the issues were test bugs. > > I found a list of tests that are problematic with specific stress flags, I adjusted the tests so that they can now be run with the flag. Often I just fix the flag at the default value, so that setting it from the outside does not affect the test. > > Below I explain for each test how and why I adjusted the test. > > - test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyNoInitDeopt.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Disabling traps by setting limit of traps to zero, means some optimistic optimizations are not made, and can therefore not lead to deoptimization. The test expects deoptimization due to traps, so we need to have them on. > - test/hotspot/jtreg/compiler/c2/cr7200264/TestDriver.java > - used by TestSSE2IntVect.java and TestSSE4IntVect.java > - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE2IntVect.java > - Problem Flags: `-XX:StressLongCountedLoop=2000000` > - Test checks the IR, and if we convert loops to long loops, some operations will not show up anymore, the test fails. Disable StressLongCountedLoop. > - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE4IntVect.java > - See TestSSE2IntVect.java > - test/hotspot/jtreg/compiler/c2/irTests/blackhole/BlackholeStoreStoreEATest.java > - Problem Flags: `-XX:-UseTLAB` > - No thread local allocation (TLAB) means IR is changed. Test checks for MemBarStoreStore, which is missing without TLAB. Solution: always have TLAB on. > - test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java > - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` > - Messes with recompilation, makes assert fail that expects recompilation. Must disable flag. > - test/hotspot/jtreg/compiler/cha/DefaultRootMethod.java > - see AbstractRootMethod.java > - test/hotspot/jtreg/compiler/intrinsics/klass/CastNullCheckDroppingsTest.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps, otherwise some optimistic optimisations are not made, and then they also are not trapped and deoptimized. > - Problem Flags: `-XX:TypeProfileLevel=222` > - Profiling also messes with optimizations / deoptimization. > - Problem Flags: `-XX:+StressReflectiveCode` > - Messes with types at allocation, which messes with optimizations. > - Problem Flags: `-XX:-UncommonNullCast` > - Is required for trapping in null checks. > - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` > - Messes with inlining / optimization - turn it off. > - test/hotspot/jtreg/compiler/jvmci/compilerToVM/IsMatureVsReprofileTest.java > - Problem Flags: `-XX:Tier4BackEdgeThreshold=1 -Xbatch -XX:-TieredCompilation` > - Lead to OSR compilation in loop calling `testMethod`, which is expected to be compiled. But with the OSR compilation, that function is inlined, and never compiled. Solution was to make sure we only compile `testMethod`. > - test/hotspot/jtreg/compiler/jvmci/compilerToVM/ReprofileTest.java > - Problem Flags: `-XX:TypeProfileLevel=222` > - Changing profile flags messes with test, which assumes default behavior. > - test/hotspot/jtreg/compiler/profiling/TestTypeProfiling.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestLongRangeCheck.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckSmearing.java > - Problem Flags: `-XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion` > - Test expects to be run at compilation tier 4 / C2, so must fix it at that in requirement. > - test/hotspot/jtreg/compiler/uncommontrap/Decompile.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0 -XX:PerBytecodeTrapLimit=0` > - The tests if we trap and decompile after we call member functions of a class that we did not use before. If we disable traps, then internally it uses a virtual call, and no deoptimization is required - but the test expects trapping and deoptimization. Solution: set trap limits to default. > - Problem Flags: `-XX:TypeProfileLevel=222` > - Changing profiling behavior also messes with deoptimization - disable it. > - test/hotspot/jtreg/compiler/uncommontrap/TestUnstableIfTrap.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Test expects traps, so we must ensure the limits are at default. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: adding IgnoreUnrecognizedVMOptions to some test files, so product builds do not fail with debug flags ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9186/files - new: https://git.openjdk.org/jdk/pull/9186/files/1c1da404..6ee74ec3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9186&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9186&range=00-01 Stats: 5 lines in 3 files changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9186.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9186/head:pull/9186 PR: https://git.openjdk.org/jdk/pull/9186 From thartmann at openjdk.org Mon Jun 20 12:56:57 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 20 Jun 2022 12:56:57 GMT Subject: RFR: 8287801: Fix test-bugs related to stress flags [v2] In-Reply-To: References: Message-ID: <1_I7cHMAX_nMrOFLqQKkdMCFYktYWt8SNH7D7Lc8ddk=.fa8bd652-1da1-4d92-89c1-622051c99c39@github.com> On Mon, 20 Jun 2022 10:45:57 GMT, Emanuel Peter wrote: >> I recently ran many tests with additional stress flags. While there were a few bugs I found, most of the issues were test bugs. >> >> I found a list of tests that are problematic with specific stress flags, I adjusted the tests so that they can now be run with the flag. Often I just fix the flag at the default value, so that setting it from the outside does not affect the test. >> >> Below I explain for each test how and why I adjusted the test. >> >> - test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyNoInitDeopt.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Disabling traps by setting limit of traps to zero, means some optimistic optimizations are not made, and can therefore not lead to deoptimization. The test expects deoptimization due to traps, so we need to have them on. >> - test/hotspot/jtreg/compiler/c2/cr7200264/TestDriver.java >> - used by TestSSE2IntVect.java and TestSSE4IntVect.java >> - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE2IntVect.java >> - Problem Flags: `-XX:StressLongCountedLoop=2000000` >> - Test checks the IR, and if we convert loops to long loops, some operations will not show up anymore, the test fails. Disable StressLongCountedLoop. >> - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE4IntVect.java >> - See TestSSE2IntVect.java >> - test/hotspot/jtreg/compiler/c2/irTests/blackhole/BlackholeStoreStoreEATest.java >> - Problem Flags: `-XX:-UseTLAB` >> - No thread local allocation (TLAB) means IR is changed. Test checks for MemBarStoreStore, which is missing without TLAB. Solution: always have TLAB on. >> - test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java >> - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` >> - Messes with recompilation, makes assert fail that expects recompilation. Must disable flag. >> - test/hotspot/jtreg/compiler/cha/DefaultRootMethod.java >> - see AbstractRootMethod.java >> - test/hotspot/jtreg/compiler/intrinsics/klass/CastNullCheckDroppingsTest.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Need traps, otherwise some optimistic optimisations are not made, and then they also are not trapped and deoptimized. >> - Problem Flags: `-XX:TypeProfileLevel=222` >> - Profiling also messes with optimizations / deoptimization. >> - Problem Flags: `-XX:+StressReflectiveCode` >> - Messes with types at allocation, which messes with optimizations. >> - Problem Flags: `-XX:-UncommonNullCast` >> - Is required for trapping in null checks. >> - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` >> - Messes with inlining / optimization - turn it off. >> - test/hotspot/jtreg/compiler/jvmci/compilerToVM/IsMatureVsReprofileTest.java >> - Problem Flags: `-XX:Tier4BackEdgeThreshold=1 -Xbatch -XX:-TieredCompilation` >> - Lead to OSR compilation in loop calling `testMethod`, which is expected to be compiled. But with the OSR compilation, that function is inlined, and never compiled. Solution was to make sure we only compile `testMethod`. >> - test/hotspot/jtreg/compiler/jvmci/compilerToVM/ReprofileTest.java >> - Problem Flags: `-XX:TypeProfileLevel=222` >> - Changing profile flags messes with test, which assumes default behavior. >> - test/hotspot/jtreg/compiler/profiling/TestTypeProfiling.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Need traps to check for optimistic optimizations. >> - test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Need traps to check for optimistic optimizations. >> - test/hotspot/jtreg/compiler/rangechecks/TestLongRangeCheck.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Need traps to check for optimistic optimizations. >> - test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckSmearing.java >> - Problem Flags: `-XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion` >> - Test expects to be run at compilation tier 4 / C2, so must fix it at that in requirement. >> - test/hotspot/jtreg/compiler/uncommontrap/Decompile.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0 -XX:PerBytecodeTrapLimit=0` >> - The tests if we trap and decompile after we call member functions of a class that we did not use before. If we disable traps, then internally it uses a virtual call, and no deoptimization is required - but the test expects trapping and deoptimization. Solution: set trap limits to default. >> - Problem Flags: `-XX:TypeProfileLevel=222` >> - Changing profiling behavior also messes with deoptimization - disable it. >> - test/hotspot/jtreg/compiler/uncommontrap/TestUnstableIfTrap.java >> - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` >> - Test expects traps, so we must ensure the limits are at default. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > adding IgnoreUnrecognizedVMOptions to some test files, so product builds do not fail with debug flags Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9186 From jbhateja at openjdk.org Mon Jun 20 13:43:02 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 20 Jun 2022 13:43:02 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v4] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 08:07:58 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Fix the ci build issue Hi @XiaohongGong , thanks for your explanations, common IR changes looks good to me. ------------- Marked as reviewed by jbhateja (Committer). PR: https://git.openjdk.org/jdk/pull/9037 From jbhateja at openjdk.org Mon Jun 20 13:43:05 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 20 Jun 2022 13:43:05 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 01:15:35 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectornode.cpp line 1669: >> >>> 1667: if (Matcher::vector_needs_partial_operations(this, vt)) { >>> 1668: return VectorNode::try_to_gen_masked_vector(phase, this, vt); >>> 1669: } >> >> This is a parent node of TrueCount/FirstTrue/LastTrue and MaskToLong which perform mask querying operation on concrete predicate operands, a transformation here looks redundant to me. > > The main reason to add the transformation here is: the FirstTrue needs the reference to the real vector length for SVE, that we need to generate a predicate when the vector length is smaller than the max vector size. Please check the changes of `partial_op_sve_needed` in aarch64_sve.ad. Real vector length can be obtained by incoming input mask node. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From epeter at openjdk.org Mon Jun 20 14:32:16 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 20 Jun 2022 14:32:16 GMT Subject: RFR: 8287801: Fix test-bugs related to stress flags [v2] In-Reply-To: <1_I7cHMAX_nMrOFLqQKkdMCFYktYWt8SNH7D7Lc8ddk=.fa8bd652-1da1-4d92-89c1-622051c99c39@github.com> References: <1_I7cHMAX_nMrOFLqQKkdMCFYktYWt8SNH7D7Lc8ddk=.fa8bd652-1da1-4d92-89c1-622051c99c39@github.com> Message-ID: On Mon, 20 Jun 2022 12:53:38 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> adding IgnoreUnrecognizedVMOptions to some test files, so product builds do not fail with debug flags > > Marked as reviewed by thartmann (Reviewer). Thanks at @TobiHartmann and @chhagedorn for the help, feedback and reviews! ------------- PR: https://git.openjdk.org/jdk/pull/9186 From epeter at openjdk.org Mon Jun 20 14:32:17 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 20 Jun 2022 14:32:17 GMT Subject: Integrated: 8287801: Fix test-bugs related to stress flags In-Reply-To: References: Message-ID: On Thu, 16 Jun 2022 14:44:19 GMT, Emanuel Peter wrote: > I recently ran many tests with additional stress flags. While there were a few bugs I found, most of the issues were test bugs. > > I found a list of tests that are problematic with specific stress flags, I adjusted the tests so that they can now be run with the flag. Often I just fix the flag at the default value, so that setting it from the outside does not affect the test. > > Below I explain for each test how and why I adjusted the test. > > - test/hotspot/jtreg/compiler/arraycopy/TestArrayCopyNoInitDeopt.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Disabling traps by setting limit of traps to zero, means some optimistic optimizations are not made, and can therefore not lead to deoptimization. The test expects deoptimization due to traps, so we need to have them on. > - test/hotspot/jtreg/compiler/c2/cr7200264/TestDriver.java > - used by TestSSE2IntVect.java and TestSSE4IntVect.java > - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE2IntVect.java > - Problem Flags: `-XX:StressLongCountedLoop=2000000` > - Test checks the IR, and if we convert loops to long loops, some operations will not show up anymore, the test fails. Disable StressLongCountedLoop. > - test/hotspot/jtreg/compiler/c2/cr7200264/TestSSE4IntVect.java > - See TestSSE2IntVect.java > - test/hotspot/jtreg/compiler/c2/irTests/blackhole/BlackholeStoreStoreEATest.java > - Problem Flags: `-XX:-UseTLAB` > - No thread local allocation (TLAB) means IR is changed. Test checks for MemBarStoreStore, which is missing without TLAB. Solution: always have TLAB on. > - test/hotspot/jtreg/compiler/cha/AbstractRootMethod.java > - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` > - Messes with recompilation, makes assert fail that expects recompilation. Must disable flag. > - test/hotspot/jtreg/compiler/cha/DefaultRootMethod.java > - see AbstractRootMethod.java > - test/hotspot/jtreg/compiler/intrinsics/klass/CastNullCheckDroppingsTest.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps, otherwise some optimistic optimisations are not made, and then they also are not trapped and deoptimized. > - Problem Flags: `-XX:TypeProfileLevel=222` > - Profiling also messes with optimizations / deoptimization. > - Problem Flags: `-XX:+StressReflectiveCode` > - Messes with types at allocation, which messes with optimizations. > - Problem Flags: `-XX:-UncommonNullCast` > - Is required for trapping in null checks. > - Problem Flags: `-XX:+StressMethodHandleLinkerInlining` > - Messes with inlining / optimization - turn it off. > - test/hotspot/jtreg/compiler/jvmci/compilerToVM/IsMatureVsReprofileTest.java > - Problem Flags: `-XX:Tier4BackEdgeThreshold=1 -Xbatch -XX:-TieredCompilation` > - Lead to OSR compilation in loop calling `testMethod`, which is expected to be compiled. But with the OSR compilation, that function is inlined, and never compiled. Solution was to make sure we only compile `testMethod`. > - test/hotspot/jtreg/compiler/jvmci/compilerToVM/ReprofileTest.java > - Problem Flags: `-XX:TypeProfileLevel=222` > - Changing profile flags messes with test, which assumes default behavior. > - test/hotspot/jtreg/compiler/profiling/TestTypeProfiling.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestLongRangeCheck.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Need traps to check for optimistic optimizations. > - test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckSmearing.java > - Problem Flags: `-XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion` > - Test expects to be run at compilation tier 4 / C2, so must fix it at that in requirement. > - test/hotspot/jtreg/compiler/uncommontrap/Decompile.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0 -XX:PerBytecodeTrapLimit=0` > - The tests if we trap and decompile after we call member functions of a class that we did not use before. If we disable traps, then internally it uses a virtual call, and no deoptimization is required - but the test expects trapping and deoptimization. Solution: set trap limits to default. > - Problem Flags: `-XX:TypeProfileLevel=222` > - Changing profiling behavior also messes with deoptimization - disable it. > - test/hotspot/jtreg/compiler/uncommontrap/TestUnstableIfTrap.java > - Problem Flags: `-XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0` > - Test expects traps, so we must ensure the limits are at default. This pull request has now been integrated. Changeset: 302a6c06 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/302a6c068dcbb176381b1535baf25547079c9b06 Stats: 30 lines in 16 files changed: 26 ins; 0 del; 4 mod 8287801: Fix test-bugs related to stress flags Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9186 From xliu at openjdk.org Mon Jun 20 21:15:56 2022 From: xliu at openjdk.org (Xin Liu) Date: Mon, 20 Jun 2022 21:15:56 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 07:29:27 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> monior change for code style. > > src/hotspot/share/opto/parse.hpp line 607: > >> 605: >> 606: // Specialized uncommon_trap of unstable_if, we have 2 optimizations for them: >> 607: // 1. suppress trivial Unstable_If traps > > Where is this done? I have another patch based on this one. I will post it soon. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From xliu at openjdk.org Mon Jun 20 21:42:54 2022 From: xliu at openjdk.org (Xin Liu) Date: Mon, 20 Jun 2022 21:42:54 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 07:32:02 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> monior change for code style. > > src/hotspot/share/opto/ifnode.cpp line 842: > >> 840: if (!igvn->C->too_many_traps(dom_method, dom_bci, Deoptimization::Reason_unstable_fused_if) && >> 841: !igvn->C->too_many_traps(dom_method, dom_bci, Deoptimization::Reason_range_check) && >> 842: igvn->C->remove_unstable_if_trap(dom_unc)) { > > This should be moved to `IfNode::merge_uncommon_traps`. I tried that in the first place. It turns out that situation is complex. The following 3 functions take place in order. has_only_uncommon_traps() => fold_compares_helper() => merge_uncommon_traps() `IfNode::has_only_uncommon_traps` is the last stop before modifying code. if we moved the predicate "remove_unstable_if_trap(dom_unc)" to `merge_uncommon_traps()`, it would be too late when it returns true. We have to rollback code change in `fold_compares_helper`. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From xliu at openjdk.org Mon Jun 20 22:26:42 2022 From: xliu at openjdk.org (Xin Liu) Date: Mon, 20 Jun 2022 22:26:42 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: <8IQeJswqjwNU_raoWW4Q0gfOCGMWaCNYsm48BXe2OaI=.0cb1ba57-2433-473c-b86f-dd4450dcf307@github.com> On Mon, 20 Jun 2022 07:24:51 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> monior change for code style. > > src/hotspot/share/opto/compile.cpp line 1929: > >> 1927: int next_bci = trap->next_bci(); >> 1928: >> 1929: if (next_bci != -1 && !trap->modified()) { > > How can it be already modified? We are only processing each trap once, right? Correct. there is only one pass right now. so it's redundant. I discussed this with Vladimir Kozlov. Technically, we can run it after 'inline_incrementally()" as well. I think we can add it when we prove this optimization is stable. "!trap->modified()" can dedup for that. > src/hotspot/share/opto/parse.hpp line 609: > >> 607: // 1. suppress trivial Unstable_If traps >> 608: // 2. use next_bci of _path to update live locals. >> 609: class UnstableIfTrap { > > What about moving this information into `CallStaticJavaNode`? I think CallStaticJavaNode is popular. uncommon_trap/unstable_if is just a special case. that's why factor out and use a dedicated class for it. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From xliu at openjdk.org Mon Jun 20 22:56:50 2022 From: xliu at openjdk.org (Xin Liu) Date: Mon, 20 Jun 2022 22:56:50 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 07:11:13 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> monior change for code style. > > src/hotspot/share/opto/parse.hpp line 643: > >> 641: // if _path has only one predecessor, it is trivial if this block is small(1~2 bytecodes) >> 642: // or if _path has more than one predecessor and has been parsed, _unc does not mask out any real code. >> 643: bool is_trivial() const { > > But these properties are not checked by the method, right? > > Also, the code is only used in debug, should it be guarded? > Looks good overall. Some comments/questions: > > * Why can't we remove traps that have been modified? > In previous revision, I did remove them. Vladimir discovered a special case in tier2. C2 postponed to do fold-compares until IGVN2. Even though this corner case has been solved in JDK-8287840, I think it's good idea to leave it as a fallback. > * I'm wondering how useful `Compile::print_statistics()` really is. Is it worth extending it? Is anyone using it? > fair enough. I used them to see how many unstable_if traps are 'trivial'. I think I can remove them now. > * Do you need to check for unstable if traps in `Node::destruct`? thanks for the head-up. technically speaking, yes. in reality, I don't think we call "Node::destruct" for a uncommon_trap. I will patch it up. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From duke at openjdk.org Mon Jun 20 23:22:00 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Mon, 20 Jun 2022 23:22:00 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v2] In-Reply-To: <9C9P4NROYxVuWTJejFnYwQOGPovUstzWACIboIQWTDw=.2977b16b-c175-4774-97af-60071b805f46@github.com> References: <9C9P4NROYxVuWTJejFnYwQOGPovUstzWACIboIQWTDw=.2977b16b-c175-4774-97af-60071b805f46@github.com> Message-ID: On Sat, 4 Jun 2022 02:02:33 GMT, Dean Long wrote: >> Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove dead codes >> >> remove unused argument of NativeJump::check_verified_entry_alignment >> remove unused argument of NativeJumip::patch_verified_entry >> remove dead codes in SharedRuntime::generate_method_handle_intrinsic_wrapper > > src/hotspot/share/ci/ciMethod.cpp line 1146: > >> 1144: CodeBlob* code = get_Method()->code(); >> 1145: if (code != NULL && code->is_compiled()) { >> 1146: code->as_compiled_method()->log_identity(log); > > Doesn't this change the log output? Yes, the output of MH intrinsic cases will become like the cases where codes are not set. The removed parts are compiled ID, compiler, and compile level. They are all constants for MH intrinsics, not providing much information. ------------- PR: https://git.openjdk.org/jdk/pull/8760 From iveresov at openjdk.org Tue Jun 21 02:24:50 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 21 Jun 2022 02:24:50 GMT Subject: RFR: 8280320: C2: Loop opts are missing during OSR compilation In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 21:29:41 GMT, Vladimir Ivanov wrote: > After [JDK-8272330](https://bugs.openjdk.org/browse/JDK-8272330), OSR compilations may completely miss loop optimizations pass due to misleading profiling data. The cleanup changed how profile counts are scaled and it had surprising effect on OSR compilations. > > For a long-running loop it's common to have an MDO allocated during the first invocation while running in the loop. Also, OSR compilation may be scheduled while running the very first method invocation. In such case, `MethodData::invocation_counter() == 0` while `MethodData::backedge_counter() > 0`. Before JDK-8272330 went in, `ciMethod::scale_count()` took into account both `invocation_counter()` and `backedge_counter()`. Now `MethodData::invocation_counter()` is taken by `ciMethod::scale_count()` as is and it forces all counts to be unconditionally scaled to `1`. > > It misleads `IdealLoopTree::beautify_loops()` to believe there are no hot > backedges in the loop being compiled and `IdealLoopTree::split_outer_loop()` > doesn't kick in thus effectively blocking any further loop optimizations. > > Proposed fix bumps `MethodData::invocation_counter()` from `0` to `1` and > enables `ciMethod::scale_count()` to report sane numbers. > > Testing: > - hs-tier1 - hs-tier4 Makes sense. Thanks for fixing this. Could you please add a comment explaining that this is a situation happening with OSR? ------------- Marked as reviewed by iveresov (Reviewer). PR: https://git.openjdk.org/jdk19/pull/38 From duke at openjdk.org Tue Jun 21 05:43:43 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Tue, 21 Jun 2022 05:43:43 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v3] In-Reply-To: References: Message-ID: > 8263377: Store method handle linkers in the 'non-nmethods' heap Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: Post dynamic_code_generate event when MH intrinsic generated ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8760/files - new: https://git.openjdk.org/jdk/pull/8760/files/00c99435..e3e9f979 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8760&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8760&range=01-02 Stats: 10 lines in 3 files changed: 10 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/8760.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8760/head:pull/8760 PR: https://git.openjdk.org/jdk/pull/8760 From xliu at openjdk.org Tue Jun 21 08:04:55 2022 From: xliu at openjdk.org (Xin Liu) Date: Tue, 21 Jun 2022 08:04:55 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v13] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/2401/files*diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea__;Iw!!ACWV5N9M2RV99hQ!OEvdxRoSM6Y2KIeeR4KG-G7YJ4tcwmapN9TYxODDESVVCvObzMbNsmc2wHlUwyIRuykFWFBPwLgFX6Lm8J9IK2ugruI$ ), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) > total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 > relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 > main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 > stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 > oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 > metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 > scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 > scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 > dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 > handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 > nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > > Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) > total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 > relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 > main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 > stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 > oops [0x00007f35189ab388,0x00007f35189ab390] = 8 > metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 > scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 > scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 > dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 > handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 > nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 29 additional commits since the last revision: - update per reviewer's feedback. also changed the option from AggressiveLivenessForUnstableIf to OptimizeUnstableIf. - Merge branch 'master' into JDK-8286104 - monior change for code style. - Bail out if fold-compares sees that a unstable_if trap has modified. Also add a regression test - Merge branch 'master' into JDK-8286104 - Remame all methods to _unstable_if_trap(s) and group them. - move preprocess() after remove Useless. - Refactor per reviewer's feedback. - Remove useless flag. if jdwp is on, liveness_at_bci() marks all local variables live. - support option AggressiveLivessForUnstableIf - ... and 19 more: https://git.openjdk.org/jdk/compare/f99fcdc7...e5c8e559 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8545/files - new: https://git.openjdk.org/jdk/pull/8545/files/81a8ccf9..e5c8e559 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8545&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8545&range=11-12 Stats: 36588 lines in 1098 files changed: 27001 ins; 5420 del; 4167 mod Patch: https://git.openjdk.org/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.org/jdk/pull/8545 From xliu at openjdk.org Tue Jun 21 08:04:58 2022 From: xliu at openjdk.org (Xin Liu) Date: Tue, 21 Jun 2022 08:04:58 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: <2UEiCky8fn9NCs8KjUKo6tWKZGaiPdrOgjFKa_GVrwI=.427fd7cf-f482-42e3-b5c5-ed6399798117@github.com> On Mon, 20 Jun 2022 07:13:15 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> monior change for code style. > > src/hotspot/share/opto/parse.hpp line 636: > >> 634: } >> 635: >> 636: Parse::Block* path() const { > > This method is not used. removed. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From xgong at openjdk.org Tue Jun 21 08:48:38 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 21 Jun 2022 08:48:38 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v5] In-Reply-To: References: Message-ID: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge branch 'jdk:master' into JDK-8286941 - Fix the ci build issue - Address review comments, revert changes for gatherL/scatterL rules - Merge branch 'jdk:master' into JDK-8286941 - Revert transformation from MaskAll to VectorMaskGen, address review comments - 8286941: Add mask IR for partial vector operations for ARM SVE ------------- Changes: https://git.openjdk.org/jdk/pull/9037/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=04 Stats: 2029 lines in 19 files changed: 784 ins; 826 del; 419 mod Patch: https://git.openjdk.org/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.org/jdk/pull/9037 From rrich at openjdk.org Tue Jun 21 12:15:36 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 21 Jun 2022 12:15:36 GMT Subject: RFR: 8288781: C1: LIR_OpVisitState::maxNumberOfOperands too small Message-ID: Increment `LIR_OpVisitState::maxNumberOfOperands` by 1 to allow C1 compilation of a method that receives 21 parameters in registers instead of crashing. Add regression test. The regression test crashes on ppc because there all parameters (8 integer + 13 float = 21) can be passed in registers. The fix passed our CI testing. This includes most JCK and JTREG test, also in Xcomp mode, on the standard platforms and also on Linux/PPC64le. ------------- Commit messages: - Increment LIR_OpVisitState::maxNumberOfOperands - Regression test TestManyMethodParameters.java Changes: https://git.openjdk.org/jdk19/pull/51/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=51&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288781 Stats: 58 lines in 2 files changed: 57 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/51.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/51/head:pull/51 PR: https://git.openjdk.org/jdk19/pull/51 From shade at openjdk.org Tue Jun 21 14:55:47 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 21 Jun 2022 14:55:47 GMT Subject: RFR: 8288781: C1: LIR_OpVisitState::maxNumberOfOperands too small In-Reply-To: References: Message-ID: On Tue, 21 Jun 2022 09:13:51 GMT, Richard Reingruber wrote: > Increment `LIR_OpVisitState::maxNumberOfOperands` by 1 to allow C1 compilation of a method that receives 21 parameters in registers instead of crashing. > > Add regression test. The regression test crashes on ppc because there all parameters (8 integer + 13 float = 21) can be passed in registers. > > The fix passed our CI testing. This includes most JCK and JTREG test, also in Xcomp mode, on the standard platforms and also on Linux/PPC64le. Looks fine. Last time this was bumped in [JDK-8004051](https://bugs.openjdk.org/browse/JDK-8004051), and there seem to be no recorded regressions for that bump. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk19/pull/51 From jbhateja at openjdk.org Tue Jun 21 15:18:54 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 21 Jun 2022 15:18:54 GMT Subject: RFR: 8288467: remove memory_operand assert for spilled instructions [v2] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 08:30:26 GMT, Emanuel Peter wrote: >> In [JDK-8282555](https://bugs.openjdk.org/browse/JDK-8282555) I added this assert, because on x64 this seems to always hold. But it turns out there are instructions on x86 (32bit) that violate this assumption. >> >> **Why it holds on x64** >> It seems we only ever do one read or one write per instruction. Instructions with multiple memory operands are extremely rare. >> 1. we spill from register to memory: we land in the if case, where the cisc node has an additional input slot for the memory edge. >> 2. we spill from register to stackSlot: no additional input slot is reserved, we land in else case and add an additional precedence edge. >> >> **Why it is violated on x86 (32bit)** >> We have additional cases that land in the else case. For example spilling `src1` from `addFPR24_reg_mem` to `addFPR24_mem_cisc`. >> https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10325-L10327__;Iw!!ACWV5N9M2RV99hQ!PYxlwoaWol7yzkAZPpyEMAZ8edamNWgRAi97PDR6JEhEdgrozu2eHCWKrlJikuxd5swedbPlrBUKokFh2qIq7YgyoL0_WDER$ https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10368-L10370__;Iw!!ACWV5N9M2RV99hQ!PYxlwoaWol7yzkAZPpyEMAZ8edamNWgRAi97PDR6JEhEdgrozu2eHCWKrlJikuxd5swedbPlrBUKokFh2qIq7YgyoF8ojvq1$ >> We land in the else case, because both have 2 inputs, thus `oper_input_base() == 2`. >> And both have memory operands, so the assert must fail. >> >> **Solutions** >> 1. Remove the Assert, as it is incorrect. >> 2. Extend the assert to be correct. >> - case 1: reg to mem spill, where we have a reserved input slot in cisc for memory edge >> - case 2: reg to stackSlot spill, where both mach and cisc have no memory operand. >> - other cases, with various register, stackSlot and memory inputs and outputs. We would have to find a general rule, and test it properly, which is not trivial because getting registers to spill is not easy to precisely provoke. >> 3. Have platform dependent asserts. But also this makes testing harder. >> >> For now I went with 1. as it is simple and as far as I can see correct. >> >> Running tests on x64 (should not fail). **I need someone to help me with testing x86 (32bit)**. I only verified the reported test failure with a 32bit build. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/chaitin.cpp > > fix typo > > Co-authored-by: Tobias Hartmann Marked as reviewed by jbhateja (Committer). Hi @eme64, Thanks for pointing this out. Original bug was related to incorrect scheduling of CISC node with just the stack operand which was circumvented by adding a precedence edge from incoming spill copy node. In general there is always a 1-1 correspondence b/w memory_operand index of CISC instruction and cisc_operand index of non-CISC counterpart. As you mentioned there are not many cases where an instruction accept more than one memory operand. Following patterns in x86_32.ad (which are relevant to SSE targets) are the only occurrences accepting more than one memory operand. instruct addFPR24_mem_cisc(stackSlotF dst, memory src1, memory src2) %{ instruct addFPR24_mem_mem(stackSlotF dst, memory src1, memory src2) %{ instruct mulFPR24_mem_mem(stackSlotF dst, memory src1, memory src2) %{ And for each of these a constant -1 value is returned as memory_operand currently. const MachOper* addFPR24_mem_ciscNode::memory_operand() const { return (MachOper*)-1; } const MachOper* addFPR24_mem_cisc_0Node::memory_operand() const { return (MachOper*)-1; } const MachOper* addFPR24_mem_memNode::memory_operand() const { return (MachOper*)-1; } const MachOper* mulFPR24_mem_memNode::memory_operand() const { return (MachOper*)-1; } Post matcher operand array is sacrosanct, only problem here is that oper_input_base() for **non-cisc counterparts** are 2 since first input is memory edge. Due to this it enters the control flow having assertion check. Your fix to remove assertion, looks safe to me. Best Regards, Jatin ------------- PR: https://git.openjdk.org/jdk19/pull/33 From epeter at openjdk.org Tue Jun 21 15:22:06 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 21 Jun 2022 15:22:06 GMT Subject: RFR: 8288467: remove memory_operand assert for spilled instructions [v2] In-Reply-To: References: Message-ID: On Tue, 21 Jun 2022 15:15:51 GMT, Jatin Bhateja wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/hotspot/share/opto/chaitin.cpp >> >> fix typo >> >> Co-authored-by: Tobias Hartmann > > Hi @eme64, > Thanks for pointing this out. Original bug was related to incorrect scheduling of CISC node with just the stack operand which was circumvented by adding a precedence edge from incoming spill copy node. In general there is always a 1-1 correspondence b/w memory_operand index of CISC instruction and cisc_operand index of non-CISC counterpart. > > As you mentioned there are not many cases where an instruction accept more than one memory operand. > > Following patterns in x86_32.ad (which are relevant to SSE targets) are the only occurrences accepting more than one memory operand. > > instruct addFPR24_mem_cisc(stackSlotF dst, memory src1, memory src2) %{ > instruct addFPR24_mem_mem(stackSlotF dst, memory src1, memory src2) %{ > instruct mulFPR24_mem_mem(stackSlotF dst, memory src1, memory src2) %{ > > And for each of these a constant -1 value is returned as memory_operand currently. > > const MachOper* addFPR24_mem_ciscNode::memory_operand() const { return (MachOper*)-1; } > const MachOper* addFPR24_mem_cisc_0Node::memory_operand() const { return (MachOper*)-1; } > const MachOper* addFPR24_mem_memNode::memory_operand() const { return (MachOper*)-1; } > const MachOper* mulFPR24_mem_memNode::memory_operand() const { return (MachOper*)-1; } > > Post matcher operand array is sacrosanct, only problem here is that oper_input_base() for **non-cisc counterparts** are 2 since first input is memory edge. Due to this it enters the control flow having assertion check. > > Your fix to remove assertion, looks safe to me. > > Best Regards, > Jatin Thanks @jatin-bhateja @shipilev @TobiHartmann for your reviews, and help with clarifying, verifying and even testing. ------------- PR: https://git.openjdk.org/jdk19/pull/33 From epeter at openjdk.org Tue Jun 21 15:24:58 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 21 Jun 2022 15:24:58 GMT Subject: Integrated: 8288467: remove memory_operand assert for spilled instructions In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 09:32:40 GMT, Emanuel Peter wrote: > In [JDK-8282555](https://bugs.openjdk.org/browse/JDK-8282555) I added this assert, because on x64 this seems to always hold. But it turns out there are instructions on x86 (32bit) that violate this assumption. > > **Why it holds on x64** > It seems we only ever do one read or one write per instruction. Instructions with multiple memory operands are extremely rare. > 1. we spill from register to memory: we land in the if case, where the cisc node has an additional input slot for the memory edge. > 2. we spill from register to stackSlot: no additional input slot is reserved, we land in else case and add an additional precedence edge. > > **Why it is violated on x86 (32bit)** > We have additional cases that land in the else case. For example spilling `src1` from `addFPR24_reg_mem` to `addFPR24_mem_cisc`. > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10325-L10327__;Iw!!ACWV5N9M2RV99hQ!Pr4vQtunH8vkcMdyxfszAANmzlFV4z9EWDUZa2J6jaD5UAxLshARsO-Ez7wih6YWUn_DxO5ApSlxRc4Wlbiv4aOkwpX9LA$ https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/53bf1bfdabb79b37afedd09051d057f9eea620f2/src/hotspot/cpu/x86/x86_32.ad*L10368-L10370__;Iw!!ACWV5N9M2RV99hQ!Pr4vQtunH8vkcMdyxfszAANmzlFV4z9EWDUZa2J6jaD5UAxLshARsO-Ez7wih6YWUn_DxO5ApSlxRc4Wlbiv4aPA5Wmz3w$ > We land in the else case, because both have 2 inputs, thus `oper_input_base() == 2`. > And both have memory operands, so the assert must fail. > > **Solutions** > 1. Remove the Assert, as it is incorrect. > 2. Extend the assert to be correct. > - case 1: reg to mem spill, where we have a reserved input slot in cisc for memory edge > - case 2: reg to stackSlot spill, where both mach and cisc have no memory operand. > - other cases, with various register, stackSlot and memory inputs and outputs. We would have to find a general rule, and test it properly, which is not trivial because getting registers to spill is not easy to precisely provoke. > 3. Have platform dependent asserts. But also this makes testing harder. > > For now I went with 1. as it is simple and as far as I can see correct. > > Running tests on x64 (should not fail). **I need someone to help me with testing x86 (32bit)**. I only verified the reported test failure with a 32bit build. This pull request has now been integrated. Changeset: af051391 Author: Emanuel Peter URL: https://git.openjdk.org/jdk19/commit/af05139133530871c88991aa0340205cfc44972a Stats: 7 lines in 1 file changed: 5 ins; 0 del; 2 mod 8288467: remove memory_operand assert for spilled instructions Reviewed-by: thartmann, shade, jbhateja ------------- PR: https://git.openjdk.org/jdk19/pull/33 From duke at openjdk.org Tue Jun 21 18:39:32 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Tue, 21 Jun 2022 18:39:32 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Fri, 10 Jun 2022 11:36:58 GMT, Evgeny Astigeevich wrote: > GHA testing is not clean. > > I looked through changes and they seem logically correct. Need more testing. I will wait when GHA is clean. Vladimir(@vnkozlov), Have you got testing results? ------------- PR: https://git.openjdk.org/jdk/pull/8816 From mdoerr at openjdk.org Tue Jun 21 21:25:19 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 21 Jun 2022 21:25:19 GMT Subject: RFR: 8288781: C1: LIR_OpVisitState::maxNumberOfOperands too small In-Reply-To: References: Message-ID: On Tue, 21 Jun 2022 09:13:51 GMT, Richard Reingruber wrote: > Increment `LIR_OpVisitState::maxNumberOfOperands` by 1 to allow C1 compilation of a method that receives 21 parameters in registers instead of crashing. > > Add regression test. The regression test crashes on ppc because there all parameters (8 integer + 13 float = 21) can be passed in registers. > > The fix passed our CI testing. This includes most JCK and JTREG test, also in Xcomp mode, on the standard platforms and also on Linux/PPC64le. Thanks for fixing it and providing a test. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk19/pull/51 From duke at openjdk.org Wed Jun 22 03:01:36 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Wed, 22 Jun 2022 03:01:36 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: > Hi, > > This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: add comparison for direct value of compare ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9068/files - new: https://git.openjdk.org/jdk/pull/9068/files/b5627135..0ab881ac Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9068&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9068&range=01-02 Stats: 18 lines in 2 files changed: 16 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9068.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9068/head:pull/9068 PR: https://git.openjdk.org/jdk/pull/9068 From duke at openjdk.org Wed Jun 22 03:01:37 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Wed, 22 Jun 2022 03:01:37 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: <46lhnsDuITWou5doLCpqWXKDsJzHVaptYMjXfjxUICk=.041b084c-b2ef-4108-a5ac-8e86ec64e819@github.com> On Fri, 10 Jun 2022 00:05:53 GMT, Sandhya Viswanathan wrote: >> I have added a benchmark for the intrinsic. The result is as follows, thanks a lot: >> >> Before After >> Benchmark (size) Mode Cnt Score Error Score Error Units >> Integers.compareUnsigned 500 avgt 15 0.527 ? 0.002 0.498 ? 0.011 us/op >> Longs.compareUnsigned 500 avgt 15 0.677 ? 0.014 0.561 ? 0.006 us/op > > @merykitty Could you please also add the micro benchmark where compareUnsigned result is stored directly in an integer and show the performance of that? Thanks @sviswa7 for the suggestion, the results of getting the value of `compareUnsigned` directly is as follow: Before After Benchmark (size) Mode Cnt Score Error Score Error Units Integers.compareUnsignedDirect 500 avgt 15 0.639 ? 0.022 0.626 ? 0.002 us/op Longs.compareUnsignedDirect 500 avgt 15 0.672 ? 0.011 0.609 ? 0.004 us/op ------------- PR: https://git.openjdk.org/jdk/pull/9068 From dlong at openjdk.org Wed Jun 22 03:40:41 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 22 Jun 2022 03:40:41 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v2] In-Reply-To: References: Message-ID: <5kyEjwQ1_IG5g8EQzAYl5ft4U1_oVcqylEWIe99i5cM=.c8b003aa-b81c-4eae-a206-af9861c75b3d@github.com> > The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. Dean Long has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/codegen/ShiftByZero.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/40/files - new: https://git.openjdk.org/jdk19/pull/40/files/d0c2bf33..e432a308 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/40.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/40/head:pull/40 PR: https://git.openjdk.org/jdk19/pull/40 From dlong at openjdk.org Wed Jun 22 03:40:41 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 22 Jun 2022 03:40:41 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v2] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 05:49:25 GMT, Tobias Hartmann wrote: > What instruction will the zero-shift be matched with then? The shift instruction that takes a vector register shift count rather than an immediate. > test/hotspot/jtreg/compiler/codegen/ShiftByZero.java line 29: > >> 27: * @summary Test shift by 0 >> 28: * @library /test/lib >> 29: * @run main compiler.codegen.ShiftByZero > > I don't think these two lines are needed. Thanks for catching it! ------------- PR: https://git.openjdk.org/jdk19/pull/40 From dlong at openjdk.org Wed Jun 22 03:42:45 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 22 Jun 2022 03:42:45 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 06:28:54 GMT, Hao Sun wrote: > Instead of introducing immI_positive, I wonder if we can generate orr dst src for zero shift count, just as the SVE part does I considered using left-shift by 0, but I don't think it's worth it to optimize this case in the back-end. If we really want to optimize shift by 0, I think it should be done in the front-end. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From dlong at openjdk.org Wed Jun 22 03:46:41 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 22 Jun 2022 03:46:41 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v3] In-Reply-To: References: Message-ID: > The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. Dean Long has updated the pull request incrementally with one additional commit since the last revision: remove unneeded lines ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/40/files - new: https://git.openjdk.org/jdk19/pull/40/files/e432a308..0f927ce7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk19/pull/40.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/40/head:pull/40 PR: https://git.openjdk.org/jdk19/pull/40 From thartmann at openjdk.org Wed Jun 22 05:50:53 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 22 Jun 2022 05:50:53 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v3] In-Reply-To: References: Message-ID: <9wwxdaEbv_VTaKmP57E97iOT1k_XAA8YoTC_ZZh9pMY=.d901ca6f-f9c1-46f2-ba19-ea6eb91ab243@github.com> On Wed, 22 Jun 2022 03:46:41 GMT, Dean Long wrote: >> The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > remove unneeded lines Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/40 From thartmann at openjdk.org Wed Jun 22 05:57:01 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 22 Jun 2022 05:57:01 GMT Subject: RFR: 8288781: C1: LIR_OpVisitState::maxNumberOfOperands too small In-Reply-To: References: Message-ID: On Tue, 21 Jun 2022 09:13:51 GMT, Richard Reingruber wrote: > Increment `LIR_OpVisitState::maxNumberOfOperands` by 1 to allow C1 compilation of a method that receives 21 parameters in registers instead of crashing. > > Add regression test. The regression test crashes on ppc because there all parameters (8 integer + 13 float = 21) can be passed in registers. > > The fix passed our CI testing. This includes most JCK and JTREG test, also in Xcomp mode, on the standard platforms and also on Linux/PPC64le. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/51 From haosun at openjdk.org Wed Jun 22 06:17:04 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 22 Jun 2022 06:17:04 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 03:40:14 GMT, Dean Long wrote: > > Instead of introducing immI_positive, I wonder if we can generate orr dst src for zero shift count, just as the SVE part does > > I considered using left-shift by 0, but I don't think it's worth it to optimize this case in the back-end. If we really want to optimize shift by 0, I think it should be done in the front-end. Okay. I think we may want to make the same update to the corresponding SVE part, e.g. https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad*L3662__;Iw!!ACWV5N9M2RV99hQ!KfV6TCMgPvv6TBAs13_7o6shZKffk3wZF2Wn7gPCbA3vzCiFZJgNknz-Q9bfnJt5PxOMM3vBSQWCpLNp89qa7VQKREbtiA$ , as it's better to align with the NEON and SVE implementations. It may deserve one separate patch, not in this bugfix one. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From roland at openjdk.org Wed Jun 22 12:23:20 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 22 Jun 2022 12:23:20 GMT Subject: RFR: 8288968: C2: remove use of covariant returns in type.[ch]pp Message-ID: This removes some use of covariant returns that I added with 8275201 (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses). ------------- Commit messages: - covariant returns Changes: https://git.openjdk.org/jdk/pull/9237/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9237&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288968 Stats: 77 lines in 4 files changed: 0 ins; 0 del; 77 mod Patch: https://git.openjdk.org/jdk/pull/9237.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9237/head:pull/9237 PR: https://git.openjdk.org/jdk/pull/9237 From duke at openjdk.org Wed Jun 22 13:00:53 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Wed, 22 Jun 2022 13:00:53 GMT Subject: RFR: 8286314: Trampoline not created for far runtime targets outside small CodeCache In-Reply-To: References: Message-ID: <1PUn88OKIt5c2AqFwnerbQt3djKc_mpW3znzaEnWbEQ=.2b8f458a-0a63-48cd-9822-85ab64c5ba66@github.com> On Wed, 22 Jun 2022 11:50:36 GMT, Evgeny Astigeevich wrote: > `relocInfo::runtime_call_type` calls can have targets inside or outside CodeCache. If offsets to the targets are not in range, trampoline calls must be used. Currently trampolines for calls are generated based on the size of CodeCache and the maximum possible branch range. If the size of CodeCache is smaller than the range, no trampoline is generated. This works well if a target is inside CodeCache. It does not work if a target is outside CodeCache and CodeCache size is smaller than the maximum possible branch range. > > ### Solution > Runtime call sites are always in CodeCache. In case of a target outside small CodeCache, we can find the start of the longest possible branch from CodeCache. Then we check with `reachable_from_branch_at` whether a target is reachable. If not, a trampoline is needed. > > ### Testing > It is tested with the release and fastdebug builds. Release builds have the maximum possible branch range 128M. Fastdebug builds have it 2M for the purpose of testing trampolines. > Results: > - `gtest`: Passed > - `tier1`: Passed Hi Andrew(@theRealAph), Could you please review the fix? Thank you, Evgeny ------------- PR: https://git.openjdk.org/jdk/pull/9235 From aph at openjdk.org Wed Jun 22 13:28:39 2022 From: aph at openjdk.org (Andrew Haley) Date: Wed, 22 Jun 2022 13:28:39 GMT Subject: RFR: 8286314: Trampoline not created for far runtime targets outside small CodeCache In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 11:50:36 GMT, Evgeny Astigeevich wrote: > `relocInfo::runtime_call_type` calls can have targets inside or outside CodeCache. If offsets to the targets are not in range, trampoline calls must be used. Currently trampolines for calls are generated based on the size of CodeCache and the maximum possible branch range. If the size of CodeCache is smaller than the range, no trampoline is generated. This works well if a target is inside CodeCache. It does not work if a target is outside CodeCache and CodeCache size is smaller than the maximum possible branch range. > > ### Solution > Runtime call sites are always in CodeCache. In case of a target outside small CodeCache, we can find the start of the longest possible branch from CodeCache. Then we check with `reachable_from_branch_at` whether a target is reachable. If not, a trampoline is needed. > > ### Testing > It is tested with the release and fastdebug builds. Release builds have the maximum possible branch range 128M. Fastdebug builds have it 2M for the purpose of testing trampolines. > Results: > - `gtest`: Passed > - `tier1`: Passed Looks good. Nice commenting too. I guess we'll need backports. Has this bug ever materialized on old releases? ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.org/jdk/pull/9235 From phh at openjdk.org Wed Jun 22 17:09:54 2022 From: phh at openjdk.org (Paul Hohensee) Date: Wed, 22 Jun 2022 17:09:54 GMT Subject: RFR: 8286314: Trampoline not created for far runtime targets outside small CodeCache In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 11:50:36 GMT, Evgeny Astigeevich wrote: > `relocInfo::runtime_call_type` calls can have targets inside or outside CodeCache. If offsets to the targets are not in range, trampoline calls must be used. Currently trampolines for calls are generated based on the size of CodeCache and the maximum possible branch range. If the size of CodeCache is smaller than the range, no trampoline is generated. This works well if a target is inside CodeCache. It does not work if a target is outside CodeCache and CodeCache size is smaller than the maximum possible branch range. > > ### Solution > Runtime call sites are always in CodeCache. In case of a target outside small CodeCache, we can find the start of the longest possible branch from CodeCache. Then we check with `reachable_from_branch_at` whether a target is reachable. If not, a trampoline is needed. > > ### Testing > It is tested with the release and fastdebug builds. Release builds have the maximum possible branch range 128M. Fastdebug builds have it 2M for the purpose of testing trampolines. > Results: > - `gtest`: Passed > - `tier1`: Passed Lgtm. MacOS debug build failure looks unrelated. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.org/jdk/pull/9235 From dlong at openjdk.org Wed Jun 22 19:02:08 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 22 Jun 2022 19:02:08 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v3] In-Reply-To: References: Message-ID: <4gnqPEmqpNoBqscnpLnmqV33ebzZ_M7NySROApbArkA=.d7980db4-485e-4ba7-9d90-606b120c9d46@github.com> On Wed, 22 Jun 2022 03:46:41 GMT, Dean Long wrote: >> The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > remove unneeded lines Thanks Tobias. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From dlong at openjdk.org Wed Jun 22 19:02:09 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 22 Jun 2022 19:02:09 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 06:13:21 GMT, Hao Sun wrote: >>> Instead of introducing immI_positive, I wonder if we can generate orr dst src for zero shift count, just as the SVE part does >> >> I considered using left-shift by 0, but I don't think it's worth it to optimize this case in the back-end. If we really want to optimize shift by 0, I think it should be done in the front-end. > >> > Instead of introducing immI_positive, I wonder if we can generate orr dst src for zero shift count, just as the SVE part does >> >> I considered using left-shift by 0, but I don't think it's worth it to optimize this case in the back-end. If we really want to optimize shift by 0, I think it should be done in the front-end. > > Okay. > > I think we may want to make the same update to the corresponding SVE part, e.g. https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad*L3662__;Iw!!ACWV5N9M2RV99hQ!IhsRjR9bgw_S9FerpoGgB7xx4uxR8awLunoOhseIUPtJ-yMlca1Ilz5T8OeLOn24rdzKzFpnbInbeQNtzHeE2b4wDy8j$ , as it's better to align with the NEON and SVE implementations. > > It may deserve one separate patch, not in this bugfix one. @shqking Yes, that's a good idea for a separate RFE targeted for jdk20. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From haosun at openjdk.org Thu Jun 23 00:31:46 2022 From: haosun at openjdk.org (Hao Sun) Date: Thu, 23 Jun 2022 00:31:46 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v3] In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 03:46:41 GMT, Dean Long wrote: >> The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > remove unneeded lines test/hotspot/jtreg/compiler/codegen/ShiftByZero.java line 42: > 40: public static void bMeth() { > 41: int shift = i32[0]; > 42: for (int i8 = 279; i8 > 1; --i8) { Why one loop is used here? I thought low 6 bits of shift can turn to be zeros in one single iteration. Is there anything I missed? Thanks. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From dlong at openjdk.org Thu Jun 23 03:28:55 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 23 Jun 2022 03:28:55 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v3] In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 00:27:45 GMT, Hao Sun wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> remove unneeded lines > > test/hotspot/jtreg/compiler/codegen/ShiftByZero.java line 42: > >> 40: public static void bMeth() { >> 41: int shift = i32[0]; >> 42: for (int i8 = 279; i8 > 1; --i8) { > > Why one loop is used here? I thought low 6 bits of shift can turn to be zeros in one single iteration. Is there anything I missed? Thanks. This is to confuse the optimizer, so that "shift" is optimized to 0 only after loop vectorization. Otherwise, the back-end may never see such a value. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From haosun at openjdk.org Thu Jun 23 03:38:41 2022 From: haosun at openjdk.org (Hao Sun) Date: Thu, 23 Jun 2022 03:38:41 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v3] In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 03:46:41 GMT, Dean Long wrote: >> The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > remove unneeded lines LGTM (I'm not a Reviewer). ------------- Marked as reviewed by haosun (Author). PR: https://git.openjdk.org/jdk19/pull/40 From haosun at openjdk.org Thu Jun 23 03:38:41 2022 From: haosun at openjdk.org (Hao Sun) Date: Thu, 23 Jun 2022 03:38:41 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v3] In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 03:24:56 GMT, Dean Long wrote: >> test/hotspot/jtreg/compiler/codegen/ShiftByZero.java line 42: >> >>> 40: public static void bMeth() { >>> 41: int shift = i32[0]; >>> 42: for (int i8 = 279; i8 > 1; --i8) { >> >> Why one loop is used here? I thought low 6 bits of shift can turn to be zeros in one single iteration. Is there anything I missed? Thanks. > > This is to confuse the optimizer, so that "shift" is optimized to 0 only after loop vectorization. Otherwise, the back-end may never see such a value. I see. Thanks. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From rrich at openjdk.org Thu Jun 23 05:45:43 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 23 Jun 2022 05:45:43 GMT Subject: RFR: 8288781: C1: LIR_OpVisitState::maxNumberOfOperands too small In-Reply-To: References: Message-ID: On Tue, 21 Jun 2022 09:13:51 GMT, Richard Reingruber wrote: > Increment `LIR_OpVisitState::maxNumberOfOperands` by 1 to allow C1 compilation of a method that receives 21 parameters in registers instead of crashing. > > Add regression test. The regression test crashes on ppc because there all parameters (8 integer + 13 float = 21) can be passed in registers. > > The fix passed our CI testing. This includes most JCK and JTREG test, also in Xcomp mode, on the standard platforms and also on Linux/PPC64le. Thanks for the review! ------------- PR: https://git.openjdk.org/jdk19/pull/51 From rrich at openjdk.org Thu Jun 23 05:45:44 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 23 Jun 2022 05:45:44 GMT Subject: Integrated: 8288781: C1: LIR_OpVisitState::maxNumberOfOperands too small In-Reply-To: References: Message-ID: On Tue, 21 Jun 2022 09:13:51 GMT, Richard Reingruber wrote: > Increment `LIR_OpVisitState::maxNumberOfOperands` by 1 to allow C1 compilation of a method that receives 21 parameters in registers instead of crashing. > > Add regression test. The regression test crashes on ppc because there all parameters (8 integer + 13 float = 21) can be passed in registers. > > The fix passed our CI testing. This includes most JCK and JTREG test, also in Xcomp mode, on the standard platforms and also on Linux/PPC64le. This pull request has now been integrated. Changeset: 3f5e48a4 Author: Richard Reingruber URL: https://git.openjdk.org/jdk19/commit/3f5e48a44ee77d07dea3d2c4ae52aaf19b8dc7cb Stats: 58 lines in 2 files changed: 57 ins; 0 del; 1 mod 8288781: C1: LIR_OpVisitState::maxNumberOfOperands too small Reviewed-by: shade, mdoerr, thartmann ------------- PR: https://git.openjdk.org/jdk19/pull/51 From xliu at openjdk.org Thu Jun 23 07:02:54 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 23 Jun 2022 07:02:54 GMT Subject: RFR: 8288968: C2: remove use of covariant returns in type.[ch]pp In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 12:15:16 GMT, Roland Westrelin wrote: > This removes some use of covariant returns that I added with 8275201 > (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr > subclasses). your patch looks good to me. I am not a reviewer. Why don't we just change the hotspot codestyle instead? This patch shows that the feature is useful in type.hpp and it does simplify code. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.org/jdk/pull/9237 From xliu at openjdk.org Thu Jun 23 07:07:42 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 23 Jun 2022 07:07:42 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 21:12:39 GMT, Xin Liu wrote: >> src/hotspot/share/opto/parse.hpp line 607: >> >>> 605: >>> 606: // Specialized uncommon_trap of unstable_if, we have 2 optimizations for them: >>> 607: // 1. suppress trivial Unstable_If traps >> >> Where is this done? > > I have another patch based on this one. I will post it soon. here is the second patch. https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/9255/commits/edb294c2b2ae4e8a5c7f6a1f9c73955af3d8b81c__;!!ACWV5N9M2RV99hQ!NmCKK8i83lMj7B37oB6j3fgE_umXXDszo2zsY0-EcGhQnDrosiy68t90P_kbRAhklIEqGLgSUMaNFf3C_hgAGgWDYYo$ ------------- PR: https://git.openjdk.org/jdk/pull/8545 From chagedorn at openjdk.org Thu Jun 23 07:20:55 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 23 Jun 2022 07:20:55 GMT Subject: RFR: 8288968: C2: remove use of covariant returns in type.[ch]pp In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 12:15:16 GMT, Roland Westrelin wrote: > This removes some use of covariant returns that I added with 8275201 > (C2: hide klass() accessor from TypeOopPtr and typeKlassPtr > subclasses). Looks good! > Why don't we just change the hotspot codestyle instead? This patch shows that the feature is useful in type.hpp and it does simplify code. I was wondering the same thing. The code becomes much cleaner if we would allow this. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9237 From xliu at openjdk.org Thu Jun 23 07:21:58 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 23 Jun 2022 07:21:58 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v12] In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 07:29:27 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> monior change for code style. > > src/hotspot/share/opto/parse.hpp line 607: > >> 605: >> 606: // Specialized uncommon_trap of unstable_if, we have 2 optimizations for them: >> 607: // 1. suppress trivial Unstable_If traps > > Where is this done? hi, @TobiHartmann Could you take a look at the new revision? the 2nd PR is based on the first one. https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/9255__;!!ACWV5N9M2RV99hQ!LsM4Xnz01EfZZLEjD_bZiPrQOghESj6aOlkKL50y4RDmAzf2Qsr5z9SYq6IccJ5lgvh67fRh_LBbdb5Rmt7vu-3jMJU$ ------------- PR: https://git.openjdk.org/jdk/pull/8545 From aph at openjdk.org Thu Jun 23 07:49:53 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 23 Jun 2022 07:49:53 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v3] In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 03:46:41 GMT, Dean Long wrote: >> The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > remove unneeded lines test/hotspot/jtreg/compiler/codegen/ShiftByZero.java line 43: > 41: int shift = i32[0]; > 42: for (int i8 = 279; i8 > 1; --i8) { > 43: shift <<= 6; Suggestion: shift <<= 6; >> This is to confuse the optimizer, so that "shift" is >> optimized to 0 only after loop vectorization. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From dlong at openjdk.org Thu Jun 23 08:09:36 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 23 Jun 2022 08:09:36 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v4] In-Reply-To: References: Message-ID: > The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. Dean Long has updated the pull request incrementally with two additional commits since the last revision: - fix comment - Update test/hotspot/jtreg/compiler/codegen/ShiftByZero.java Co-authored-by: Andrew Haley ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/40/files - new: https://git.openjdk.org/jdk19/pull/40/files/0f927ce7..583c54c0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=02-03 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk19/pull/40.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/40/head:pull/40 PR: https://git.openjdk.org/jdk19/pull/40 From ysuenaga at openjdk.org Thu Jun 23 08:11:59 2022 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Thu, 23 Jun 2022 08:11:59 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v4] In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 01:10:12 GMT, Yuta Sato wrote: >> When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), >> there is no warning message (only can see info level messages if put -Xlog:os=info). >> This should show a warning message to tell the user that you failed to load libraries for hsdis. >> So I put a warning message to notify this. >> >> e.g. >> ` > > Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: > > change warning message Looks good. I will sponsor you if you need. ------------- Marked as reviewed by ysuenaga (Reviewer). PR: https://git.openjdk.org/jdk/pull/8782 From duke at openjdk.org Thu Jun 23 09:17:52 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 23 Jun 2022 09:17:52 GMT Subject: RFR: 8286314: Trampoline not created for far runtime targets outside small CodeCache In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 13:25:19 GMT, Andrew Haley wrote: > Looks good. Nice commenting too. Thank you for review and feedback. > I guess we'll need backports. Has this bug ever materialized on old releases? I have not found any bug reports. I guess most of applications run with 240M CodeCache. Applications at risk are those which turn off tiered compilation. By default they get 48M CodeCache. They might have been lucky the whole code is within 128M range. ------------- PR: https://git.openjdk.org/jdk/pull/9235 From duke at openjdk.org Thu Jun 23 09:17:53 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 23 Jun 2022 09:17:53 GMT Subject: RFR: 8286314: Trampoline not created for far runtime targets outside small CodeCache In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 17:05:37 GMT, Paul Hohensee wrote: > Lgtm. Thank you ------------- PR: https://git.openjdk.org/jdk/pull/9235 From bulasevich at openjdk.org Thu Jun 23 09:29:18 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 23 Jun 2022 09:29:18 GMT Subject: RFR: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support Message-ID: Fixing ARM32 jtreg fails: - compiler/floatingpoint/TestFloatJNIArgs.java - runtime/logging/MonitorMismatchTest.java.MonitorMismatchTest # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (c1_LIRAssembler_arm.cpp:1482), pid=21156, tid=21171 # Error: ShouldNotReachHere() ------------- Commit messages: - 8289044: ARM32: missing LIR_Assembler::cmove metadata type support Changes: https://git.openjdk.org/jdk/pull/9258/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9258&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289044 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9258.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9258/head:pull/9258 PR: https://git.openjdk.org/jdk/pull/9258 From aph at openjdk.org Thu Jun 23 09:34:53 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 23 Jun 2022 09:34:53 GMT Subject: RFR: 8286314: Trampoline not created for far runtime targets outside small CodeCache In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 09:13:44 GMT, Evgeny Astigeevich wrote: > > I guess we'll need backports. Has this bug ever materialized on old releases? > > I have not found any bug reports. I guess most of applications run with 240M CodeCache. Applications at risk are those which turn off tiered compilation. By default they get 48M CodeCache. They might have been lucky the whole code is within 128M range. OK, so I'd probably not backport, then. I know it feels a bit weird to leave a known bug, but perhaps it's better to let sleeping dogs lie. What do you think? ------------- PR: https://git.openjdk.org/jdk/pull/9235 From duke at openjdk.org Thu Jun 23 09:34:55 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 23 Jun 2022 09:34:55 GMT Subject: Integrated: 8286314: Trampoline not created for far runtime targets outside small CodeCache In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 11:50:36 GMT, Evgeny Astigeevich wrote: > `relocInfo::runtime_call_type` calls can have targets inside or outside CodeCache. If offsets to the targets are not in range, trampoline calls must be used. Currently trampolines for calls are generated based on the size of CodeCache and the maximum possible branch range. If the size of CodeCache is smaller than the range, no trampoline is generated. This works well if a target is inside CodeCache. It does not work if a target is outside CodeCache and CodeCache size is smaller than the maximum possible branch range. > > ### Solution > Runtime call sites are always in CodeCache. In case of a target outside small CodeCache, we can find the start of the longest possible branch from CodeCache. Then we check with `reachable_from_branch_at` whether a target is reachable. If not, a trampoline is needed. > > ### Testing > It is tested with the release and fastdebug builds. Release builds have the maximum possible branch range 128M. Fastdebug builds have it 2M for the purpose of testing trampolines. > Results: > - `gtest`: Passed > - `tier1`: Passed This pull request has now been integrated. Changeset: bf0623b1 Author: Evgeny Astigeevich Committer: Andrew Haley URL: https://git.openjdk.org/jdk/commit/bf0623b11fd95f09fe953822af71d965bdab8d0f Stats: 17 lines in 1 file changed: 15 ins; 0 del; 2 mod 8286314: Trampoline not created for far runtime targets outside small CodeCache Reviewed-by: aph, phh ------------- PR: https://git.openjdk.org/jdk/pull/9235 From chagedorn at openjdk.org Thu Jun 23 09:44:21 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 23 Jun 2022 09:44:21 GMT Subject: RFR: 8288669: compiler/vectorapi/VectorFPtoIntCastTest.java still fails with "IRViolationException: There were one or multiple IR rule failures." Message-ID: IR matching fails when run with `-XX:-TieredCompilation` because we don't have enough profiling information. There is an RFE ([JDK-8276547](https://bugs.openjdk.org/browse/JDK-8276547)) that should change the fixed default warm-up into a compilation policy based warm-up to prevent such test failures in the future. The fix for this test is to just increase the default warm-up for now. Testing: - tier1-4 flags for VectorFPtoIntCastTest.java - repeated runs with tier3 flags (includes `-XX:-TieredCompilation`) Thanks, Christian ------------- Commit messages: - 8288669: compiler/vectorapi/VectorFPtoIntCastTest.java still fails with "IRViolationException: There were one or multiple IR rule failures." Changes: https://git.openjdk.org/jdk/pull/9259/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9259&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288669 Stats: 11 lines in 1 file changed: 3 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9259.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9259/head:pull/9259 PR: https://git.openjdk.org/jdk/pull/9259 From thartmann at openjdk.org Thu Jun 23 10:19:56 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 23 Jun 2022 10:19:56 GMT Subject: RFR: 8288669: compiler/vectorapi/VectorFPtoIntCastTest.java still fails with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 09:36:41 GMT, Christian Hagedorn wrote: > IR matching fails when run with `-XX:-TieredCompilation` because we don't have enough profiling information. There is an RFE ([JDK-8276547](https://bugs.openjdk.org/browse/JDK-8276547)) that should change the fixed default warm-up into a compilation policy based warm-up to prevent such test failures in the future. The fix for this test is to just increase the default warm-up for now. > > Testing: > - tier1-4 flags for VectorFPtoIntCastTest.java > - repeated runs with tier3 flags (includes `-XX:-TieredCompilation`) > > Thanks, > Christian Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9259 From duke at openjdk.org Thu Jun 23 10:26:54 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 23 Jun 2022 10:26:54 GMT Subject: RFR: 8286314: Trampoline not created for far runtime targets outside small CodeCache In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 09:31:13 GMT, Andrew Haley wrote: > OK, so I'd probably not backport, then. I know it feels a bit weird to leave a known bug, but perhaps it's better to let sleeping dogs lie. What do you think? We have `guarantee` which will catch the bug. However ASLR might make difficult to reproduce the bug. I think it is not critical for 8 and 11. We are working to make JVM performance with 240M CodeCache similar to the performance with smaller CodeCache. This will decrease chances that people will use small CodeCache. We will be backporting enhancement to 17. It looks to me, risks are mitigated in old releases. There is no need for backports. ------------- PR: https://git.openjdk.org/jdk/pull/9235 From rkennke at openjdk.org Thu Jun 23 11:56:17 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 23 Jun 2022 11:56:17 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. [v2] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 11:59:14 GMT, Roland Westrelin wrote: >> With JDK-8277654, the load barrier slow path call doesn't produce raw >> memory anymore but the IU barrier call still does. I propose removing >> raw memory for that call too which also causes the assert that fails >> to be removed. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - new fix > - Merge branch 'master' into JDK-8287227 > - Revert "fix" > > This reverts commit aa6f80a7883ee7032f81dbffac5d0257491d7118. > - fix Looks good to me! Thank you! ------------- Marked as reviewed by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/8958 From thartmann at openjdk.org Thu Jun 23 12:44:13 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 23 Jun 2022 12:44:13 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v13] In-Reply-To: References: Message-ID: On Tue, 21 Jun 2022 08:04:55 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/2401/files*diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea__;Iw!!ACWV5N9M2RV99hQ!OjkoNbSvQ2MWsYQpdaEU8TLyvatYLT-cRXJBtxtBG0tg0hq-OP56ap1JFbWDSnhVHNRSGG9IAchXsBG5uY6JNGShWPnBK6OlQQ$ ), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 >> relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 >> main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 >> stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 >> oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 >> metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 >> scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 >> scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 >> dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 >> handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 >> nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> >> Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 >> relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 >> main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 >> stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 >> oops [0x00007f35189ab388,0x00007f35189ab390] = 8 >> metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 >> scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 >> scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 >> dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 >> handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 >> nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 29 additional commits since the last revision: > > - update per reviewer's feedback. > > also changed the option from AggressiveLivenessForUnstableIf to > OptimizeUnstableIf. > - Merge branch 'master' into JDK-8286104 > - monior change for code style. > - Bail out if fold-compares sees that a unstable_if trap has modified. > > Also add a regression test > - Merge branch 'master' into JDK-8286104 > - Remame all methods to _unstable_if_trap(s) and group them. > - move preprocess() after remove Useless. > - Refactor per reviewer's feedback. > - Remove useless flag. if jdwp is on, liveness_at_bci() marks all local > variables live. > - support option AggressiveLivessForUnstableIf > - ... and 19 more: https://git.openjdk.org/jdk/compare/0eb61fef...e5c8e559 Thanks for the clarifications. You added the same test twice with a slightly different name. src/hotspot/share/opto/compile.cpp line 1954: > 1952: } > 1953: > 1954: // keep the mondified for late query Suggestion: // keep the modified trap for late query src/hotspot/share/opto/parse.hpp line 613: > 611: // Parse::_blocks outlive Parse object itself. > 612: // They are reclaimed by ResourceMark in CompileBroker::invoke_compiler_on_method(). > 613: Parse::Block* const _path; // the pruned path Do we really need to keep track of the entire Block? Looks like we could just save next_bci as int. test/hotspot/jtreg/compiler/c2/irTests/TestOptimizeForUnstableIf.java line 32: > 30: * @test > 31: * @bug 8286104 > 32: * @summary Test C2 uses aggressive liveness to get rid of the boxing object which is Suggestion: * @summary Test that C2 uses aggressive liveness to get rid of the boxing object which is test/hotspot/jtreg/compiler/c2/irTests/TestOptimizeUnstableIf.java line 32: > 30: * @test > 31: * @bug 8286104 > 32: * @summary Test C2 uses aggressive liveness to get rid of the boxing object which is Suggestion: * @summary Test that C2 uses aggressive liveness to get rid of the boxing object which is ------------- PR: https://git.openjdk.org/jdk/pull/8545 From aph at openjdk.org Thu Jun 23 15:00:19 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 23 Jun 2022 15:00:19 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler Message-ID: All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. Here's an example of what was happening: ` rax->encoding();` Where rax is defined as `(Register *)0`. This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. typedef const RegisterImpl* Register; extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } constexpr Register rax = as_register(0); ------------- Commit messages: - Compiles - Compiles - Compiles Changes: https://git.openjdk.org/jdk/pull/9261/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289046 Stats: 78 lines in 5 files changed: 29 ins; 9 del; 40 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From epeter at openjdk.org Thu Jun 23 15:00:22 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 23 Jun 2022 15:00:22 GMT Subject: RFR: 8288897: Clean up node dump code Message-ID: I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). This change sets cleans up the code around, and tries to reduce code duplication. Things I did: - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private - Adding `call from debugger` comment to VM functions that are useful in debugger ------------- Commit messages: - cleanup, move debug functions to cpp to prevent inlining, add comment for debugger functions - make dump_bfs const, change datastructures, change some signatures to const - refactor dump to use dump_bfs, redefine categories through output types - 8288897: Clean up dump code for nodes Changes: https://git.openjdk.org/jdk/pull/9234/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288897 Stats: 649 lines in 17 files changed: 40 ins; 525 del; 84 mod Patch: https://git.openjdk.org/jdk/pull/9234.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9234/head:pull/9234 PR: https://git.openjdk.org/jdk/pull/9234 From shade at openjdk.org Thu Jun 23 15:20:49 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 23 Jun 2022 15:20:49 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 14:52:54 GMT, Andrew Haley wrote: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Cursory review: src/hotspot/cpu/x86/globalDefinitions_x86.hpp line 78: > 76: #endif > 77: > 78: #define USE_POINTERS_TO_REGISTER_IMPL_ARRAY So, what's the use for this symbol? I see AArch64 code conditionalize definition macros, but this patch does not have such conditionalization. Should it? src/hotspot/cpu/x86/register_x86.hpp line 70: > 68: // accessors > 69: int raw_encoding() const { return this - first(); } > 70: int encoding() const { assert(is_valid(), "invalid register"); return (this - first()); } Suggestion: int encoding() const { assert(is_valid(), "invalid register"); return raw_encoding(); } src/hotspot/cpu/x86/register_x86.hpp line 76: > 74: }; > 75: > 76: // The implementation of integer registers for the ia32 architecture This comment seems unrelated, copy-paste error? src/hotspot/cpu/x86/register_x86.hpp line 130: > 128: // accessors > 129: int raw_encoding() const { return this - first(); } > 130: int encoding() const { assert(is_valid(), "invalid register"); return this - first(); } Suggestion: int encoding() const { assert(is_valid(), "invalid register"); return raw_encoding(); } src/hotspot/cpu/x86/register_x86.hpp line 171: > 169: // accessors > 170: int raw_encoding() const { return this - first(); } > 171: int encoding() const { assert(is_valid(), "invalid register (%d)", (int)raw_encoding() ); return raw_encoding(); } In the other cases, we don't do this kind of printing assert. Stick to one style? I think it is fine to have a shorter assert, because `!is_valid()` basically implies `noreg`? ------------- Changes requested by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/9261 From duke at openjdk.org Thu Jun 23 15:35:53 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Thu, 23 Jun 2022 15:35:53 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 14:52:54 GMT, Andrew Haley wrote: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); I believe there are some compiler directives somewhere to silent the compiler of `nullptr` dereference, should we delete those also? ------------- PR: https://git.openjdk.org/jdk/pull/9261 From dcubed at openjdk.org Thu Jun 23 20:10:36 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Thu, 23 Jun 2022 20:10:36 GMT Subject: RFR: 8288669: compiler/vectorapi/VectorFPtoIntCastTest.java still fails with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 09:36:41 GMT, Christian Hagedorn wrote: > IR matching fails when run with `-XX:-TieredCompilation` because we don't have enough profiling information. There is an RFE ([JDK-8276547](https://bugs.openjdk.org/browse/JDK-8276547)) that should change the fixed default warm-up into a compilation policy based warm-up to prevent such test failures in the future. The fix for this test is to just increase the default warm-up for now. > > Testing: > - tier1-4 flags for VectorFPtoIntCastTest.java > - repeated runs with tier3 flags (includes `-XX:-TieredCompilation`) > > Thanks, > Christian Thumbs up. I also agree that this is a trivial fix. ------------- Marked as reviewed by dcubed (Reviewer). PR: https://git.openjdk.org/jdk/pull/9259 From duke at openjdk.org Thu Jun 23 22:38:06 2022 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Thu, 23 Jun 2022 22:38:06 GMT Subject: RFR: JDK-8288121: [JVMCI] Re-export the TerminatingThreadLocal functionality to the graal compiler. Message-ID: JVMCI compilers need to release resources tied to a thread-local variable when the associated thread is exiting. The JDK internally uses the jdk.internal.misc.TerminatingThreadLocal class for this purpose. This pull request re-exports the TerminatingThreadLocal functionality to JVMCI compilers. ------------- Commit messages: - JDK-8288121: Re-export the jdk.internal.misc.TerminatingThreadLocal functionality to the jvmci compiler. Changes: https://git.openjdk.org/jdk/pull/9107/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9107&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288121 Stats: 32 lines in 1 file changed: 32 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9107.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9107/head:pull/9107 PR: https://git.openjdk.org/jdk/pull/9107 From xliu at openjdk.org Thu Jun 23 22:43:09 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 23 Jun 2022 22:43:09 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v13] In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 12:31:23 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 29 additional commits since the last revision: >> >> - update per reviewer's feedback. >> >> also changed the option from AggressiveLivenessForUnstableIf to >> OptimizeUnstableIf. >> - Merge branch 'master' into JDK-8286104 >> - monior change for code style. >> - Bail out if fold-compares sees that a unstable_if trap has modified. >> >> Also add a regression test >> - Merge branch 'master' into JDK-8286104 >> - Remame all methods to _unstable_if_trap(s) and group them. >> - move preprocess() after remove Useless. >> - Refactor per reviewer's feedback. >> - Remove useless flag. if jdwp is on, liveness_at_bci() marks all local >> variables live. >> - support option AggressiveLivessForUnstableIf >> - ... and 19 more: https://git.openjdk.org/jdk/compare/500b937a...e5c8e559 > > src/hotspot/share/opto/parse.hpp line 613: > >> 611: // Parse::_blocks outlive Parse object itself. >> 612: // They are reclaimed by ResourceMark in CompileBroker::invoke_compiler_on_method(). >> 613: Parse::Block* const _path; // the pruned path > > Do we really need to keep track of the entire Block? Looks like we could just save next_bci as int. good point. I used path() in the second patch. now I store those traps in their basic blocks. I don't need to remember them. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From xliu at openjdk.org Thu Jun 23 23:08:20 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 23 Jun 2022 23:08:20 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v14] In-Reply-To: References: Message-ID: <_6iPSDvWGj8uGcVNGdwhRBa23bCVOVaMsUhY0crvxYM=.112ba1de-6a1a-417c-8446-3413a6ab8157@github.com> > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/2401/files*diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea__;Iw!!ACWV5N9M2RV99hQ!NGvzvAI4XHXDJBGUPmCTqn-cc0s6a9P3pFoZx1A4P8SmKEbBQsx4CRQB1pqjCYd-xtuczzA4g1VMksC-dI91KgYfVug$ ), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) > total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 > relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 > main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 > stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 > oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 > metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 > scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 > scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 > dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 > handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 > nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > > Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) > total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 > relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 > main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 > stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 > oops [0x00007f35189ab388,0x00007f35189ab390] = 8 > metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 > scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 > scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 > dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 > handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 > nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: remove _path from UnstableIfTrap. remember _next_bci(int) is enough. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8545/files - new: https://git.openjdk.org/jdk/pull/8545/files/e5c8e559..49bcc410 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8545&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8545&range=12-13 Stats: 83 lines in 4 files changed: 2 ins; 76 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.org/jdk/pull/8545 From duke at openjdk.org Fri Jun 24 00:16:14 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Fri, 24 Jun 2022 00:16:14 GMT Subject: RFR: 8289071: Compute stub sizes outside of locks Message-ID: 8289071: Compute stub sizes outside of locks ------------- Commit messages: - 8289071: Compute stub sizes outside of locks Changes: https://git.openjdk.org/jdk/pull/9266/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9266&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289071 Stats: 25 lines in 2 files changed: 8 ins; 7 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/9266.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9266/head:pull/9266 PR: https://git.openjdk.org/jdk/pull/9266 From haosun at openjdk.org Fri Jun 24 02:02:00 2022 From: haosun at openjdk.org (Hao Sun) Date: Fri, 24 Jun 2022 02:02:00 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v4] In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 08:09:36 GMT, Dean Long wrote: >> The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. > > Dean Long has updated the pull request incrementally with two additional commits since the last revision: > > - fix comment > - Update test/hotspot/jtreg/compiler/codegen/ShiftByZero.java > > Co-authored-by: Andrew Haley test/hotspot/jtreg/compiler/codegen/ShiftByZero.java line 43: > 41: int shift = i32[0]; > 42: // This loop is to confuse the optimizer, so that "shift" is > 43: // optimized to 0 only after loop vectorization. It seems that we should remove the trailing whitespace, as suggested by `jcheck`. Suggestion: // optimized to 0 only after loop vectorization. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From duke at openjdk.org Fri Jun 24 02:27:39 2022 From: duke at openjdk.org (Yuta Sato) Date: Fri, 24 Jun 2022 02:27:39 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v5] In-Reply-To: References: Message-ID: > When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), > there is no warning message (only can see info level messages if put -Xlog:os=info). > This should show a warning message to tell the user that you failed to load libraries for hsdis. > So I put a warning message to notify this. > > e.g. > ` Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: Update full name ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8782/files - new: https://git.openjdk.org/jdk/pull/8782/files/58724253..cc6512c0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8782&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8782&range=03-04 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/8782.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8782/head:pull/8782 PR: https://git.openjdk.org/jdk/pull/8782 From duke at openjdk.org Fri Jun 24 02:37:36 2022 From: duke at openjdk.org (Yuta Sato) Date: Fri, 24 Jun 2022 02:37:36 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive [v3] In-Reply-To: References: Message-ID: <1Tv6-knpm8vnHCpznErIOJ_AKOfbK67sWAUq6xdKltw=.9dff2833-5387-43d2-9081-96d0959ebfe5@github.com> On Tue, 31 May 2022 03:22:46 GMT, Yuta Sato wrote: >> When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , >> it shows totally the same message for c1 and c2 compiler and the user would be confused about >> which compiler is affected by this message. >> This should show messages with their compiler name so that the user knows which compiler shows this message. >> >> My change result would be like the below. >> >> >> OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> >> -> >> >> OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > > Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: > > add const to method Thank you for reviewing and sponsoring !! ------------- PR: https://git.openjdk.org/jdk/pull/8591 From duke at openjdk.org Fri Jun 24 06:17:03 2022 From: duke at openjdk.org (Yuta Sato) Date: Fri, 24 Jun 2022 06:17:03 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v5] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 19:29:28 GMT, Vladimir Kozlov wrote: >> Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: >> >> Update full name > > Can you put the warning into `dll_load()`? > We already print messages there with `-XX:+Vebose` (unfortunately it is available only in debug VM). > Actually consider replacing print statements and `Verbose` check there with UL. @vnkozlov If you have some time, could you review again this? As JohnTortugo also said, I believe `load library` is the best place to show the error message. ------------- PR: https://git.openjdk.org/jdk/pull/8782 From chagedorn at openjdk.org Fri Jun 24 07:29:56 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 24 Jun 2022 07:29:56 GMT Subject: RFR: 8288669: compiler/vectorapi/VectorFPtoIntCastTest.java still fails with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: <-J02WBqw_G3bpF4XwWFe_hV5DX2XqdtZhO900Qj6GLA=.4afcfd7a-ca4d-414d-be83-fd395636580f@github.com> On Thu, 23 Jun 2022 09:36:41 GMT, Christian Hagedorn wrote: > IR matching fails when run with `-XX:-TieredCompilation` because we don't have enough profiling information. There is an RFE ([JDK-8276547](https://bugs.openjdk.org/browse/JDK-8276547)) that should change the fixed default warm-up into a compilation policy based warm-up to prevent such test failures in the future. The fix for this test is to just increase the default warm-up for now. > > Testing: > - tier1-4 flags for VectorFPtoIntCastTest.java > - repeated runs with tier3 flags (includes `-XX:-TieredCompilation`) > > Thanks, > Christian Thanks Tobias and Daniel for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/9259 From chagedorn at openjdk.org Fri Jun 24 07:31:43 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 24 Jun 2022 07:31:43 GMT Subject: Integrated: 8288669: compiler/vectorapi/VectorFPtoIntCastTest.java still fails with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 09:36:41 GMT, Christian Hagedorn wrote: > IR matching fails when run with `-XX:-TieredCompilation` because we don't have enough profiling information. There is an RFE ([JDK-8276547](https://bugs.openjdk.org/browse/JDK-8276547)) that should change the fixed default warm-up into a compilation policy based warm-up to prevent such test failures in the future. The fix for this test is to just increase the default warm-up for now. > > Testing: > - tier1-4 flags for VectorFPtoIntCastTest.java > - repeated runs with tier3 flags (includes `-XX:-TieredCompilation`) > > Thanks, > Christian This pull request has now been integrated. Changeset: 17aacde5 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/17aacde50fb971bc686825772e29f6bfecadabda Stats: 11 lines in 1 file changed: 3 ins; 0 del; 8 mod 8288669: compiler/vectorapi/VectorFPtoIntCastTest.java still fails with "IRViolationException: There were one or multiple IR rule failures." Reviewed-by: thartmann, dcubed ------------- PR: https://git.openjdk.org/jdk/pull/9259 From dlong at openjdk.org Fri Jun 24 07:34:57 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 24 Jun 2022 07:34:57 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v5] In-Reply-To: References: Message-ID: > The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. Dean Long has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/codegen/ShiftByZero.java Co-authored-by: Hao Sun ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/40/files - new: https://git.openjdk.org/jdk19/pull/40/files/583c54c0..f4761fbf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=40&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/40.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/40/head:pull/40 PR: https://git.openjdk.org/jdk19/pull/40 From thartmann at openjdk.org Fri Jun 24 07:40:00 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 24 Jun 2022 07:40:00 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v14] In-Reply-To: <_6iPSDvWGj8uGcVNGdwhRBa23bCVOVaMsUhY0crvxYM=.112ba1de-6a1a-417c-8446-3413a6ab8157@github.com> References: <_6iPSDvWGj8uGcVNGdwhRBa23bCVOVaMsUhY0crvxYM=.112ba1de-6a1a-417c-8446-3413a6ab8157@github.com> Message-ID: <-O3tPwzkyhgV5VdSHpatagIaoWYwN3Vs6xOtMbB5O9I=.de0bf65a-9b2c-486b-bb1c-d1a8b0220ebb@github.com> On Thu, 23 Jun 2022 23:08:20 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/2401/files*diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea__;Iw!!ACWV5N9M2RV99hQ!IaqAGx7E6wqma4HBo4GvFX9t9c186f-S_gIOAMA4rErIGR79hjbsf0ZGbPV9rm2PGMgO6MYSrHhQvnulpwMFaStexZWZd2hwHA$ ), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 >> relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 >> main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 >> stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 >> oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 >> metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 >> scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 >> scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 >> dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 >> handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 >> nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> >> Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 >> relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 >> main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 >> stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 >> oops [0x00007f35189ab388,0x00007f35189ab390] = 8 >> metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 >> scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 >> scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 >> dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 >> handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 >> nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > remove _path from UnstableIfTrap. remember _next_bci(int) is enough. Looks good! Let me re-submit testing. I'll report back once it passed. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From aph at openjdk.org Fri Jun 24 08:05:59 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 24 Jun 2022 08:05:59 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v2] In-Reply-To: References: Message-ID: <_SU_km9apP4JGGdPbD988Qn_RuvTnIdPUrX2gGbgrb8=.e4cffd1b-016d-4824-9fac-fd3ba53afa47@github.com> > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with two additional commits since the last revision: - Update src/hotspot/cpu/x86/register_x86.hpp Co-authored-by: Aleksey Shipil?v - Update src/hotspot/cpu/x86/register_x86.hpp Co-authored-by: Aleksey Shipil?v ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/718e210f..36ba30bc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From aph at openjdk.org Fri Jun 24 08:06:00 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 24 Jun 2022 08:06:00 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 15:32:21 GMT, Quan Anh Mai wrote: > I believe there are some compiler directives somewhere to silent the compiler of `nullptr` dereference, should we delete those also? Not yet. I'm going through the VM to find null pointer dereferences. Once I've got them all, then I'll delete the warning. ------------- PR: https://git.openjdk.org/jdk/pull/9261 From aph at openjdk.org Fri Jun 24 08:06:05 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 24 Jun 2022 08:06:05 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v2] In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 15:17:22 GMT, Aleksey Shipilev wrote: >> Andrew Haley has updated the pull request incrementally with two additional commits since the last revision: >> >> - Update src/hotspot/cpu/x86/register_x86.hpp >> >> Co-authored-by: Aleksey Shipil?v >> - Update src/hotspot/cpu/x86/register_x86.hpp >> >> Co-authored-by: Aleksey Shipil?v > > src/hotspot/cpu/x86/globalDefinitions_x86.hpp line 78: > >> 76: #endif >> 77: >> 78: #define USE_POINTERS_TO_REGISTER_IMPL_ARRAY > > So, what's the use for this symbol? I see AArch64 code conditionalize definition macros, but this patch does not have such conditionalization. Should it? It means we pick up the correct definitions of `CONSTANT_REGISTER_DECLARATION` et al. > src/hotspot/cpu/x86/register_x86.hpp line 171: > >> 169: // accessors >> 170: int raw_encoding() const { return this - first(); } >> 171: int encoding() const { assert(is_valid(), "invalid register (%d)", (int)raw_encoding() ); return raw_encoding(); } > > In the other cases, we don't do this kind of printing assert. Stick to one style? I think it is fine to have a shorter assert, because `!is_valid()` basically implies `noreg`? OK. ------------- PR: https://git.openjdk.org/jdk/pull/9261 From shade at openjdk.org Fri Jun 24 08:12:00 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 24 Jun 2022 08:12:00 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v2] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 07:57:15 GMT, Andrew Haley wrote: >> src/hotspot/cpu/x86/globalDefinitions_x86.hpp line 78: >> >>> 76: #endif >>> 77: >>> 78: #define USE_POINTERS_TO_REGISTER_IMPL_ARRAY >> >> So, what's the use for this symbol? I see AArch64 code conditionalize definition macros, but this patch does not have such conditionalization. Should it? > > It means we pick up the correct definitions of `CONSTANT_REGISTER_DECLARATION` et al. Ah, those definitions are in shared code! Nevermind then. ------------- PR: https://git.openjdk.org/jdk/pull/9261 From roland at openjdk.org Fri Jun 24 08:24:35 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 24 Jun 2022 08:24:35 GMT Subject: RFR: 8288968: C2: remove use of covariant returns in type.[ch]pp In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 07:17:18 GMT, Christian Hagedorn wrote: > > Why don't we just change the hotspot codestyle instead? This patch shows that the feature is useful in type.hpp and it does simplify code. > > I was wondering the same thing. The code becomes much cleaner if we would allow this. I'll propose a change to the coding style. ------------- PR: https://git.openjdk.org/jdk/pull/9237 From chagedorn at openjdk.org Fri Jun 24 09:01:53 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 24 Jun 2022 09:01:53 GMT Subject: RFR: 8288968: C2: remove use of covariant returns in type.[ch]pp In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 08:21:18 GMT, Roland Westrelin wrote: > > > Why don't we just change the hotspot codestyle instead? This patch shows that the feature is useful in type.hpp and it does simplify code. > > > > > > I was wondering the same thing. The code becomes much cleaner if we would allow this. > > I'll propose a change to the coding style. Sounds good, thanks for doing that! ------------- PR: https://git.openjdk.org/jdk/pull/9237 From chagedorn at openjdk.org Fri Jun 24 09:04:26 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 24 Jun 2022 09:04:26 GMT Subject: RFR: 8288683: C2: And node gets wrong type due to not adding it back to the worklist in CCP Message-ID: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> [JDK-8277850](https://bugs.openjdk.org/browse/JDK-8277850) added some new optimizations in `AndI/L::Value()` to optimize patterns similar to `(v << 2) & 3` which can directly be replaced by zero if the mask and the shifted value are bitwise disjoint. To do that, we look at the type of the shift value of the `LShift` input (right-hand side input): https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/mulnode.cpp*L1752-L1765__;Iw!!ACWV5N9M2RV99hQ!O0NM0CMDn0zSyzJpqn4Xm9GQAaL2ENMzxotX4HKhK4Z7a5AvAUD0c_QkNp5DyrUKENfCThYgMWZBpiqn62SQDlsZZVJ06bQYlA$ The optimization as such works fine but there is a problem in CCP. After calling `Value()` for a node in CCP, we generally only add the direct users of it back to the worklist if the type changed: https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1812-L1814__;Iw!!ACWV5N9M2RV99hQ!O0NM0CMDn0zSyzJpqn4Xm9GQAaL2ENMzxotX4HKhK4Z7a5AvAUD0c_QkNp5DyrUKENfCThYgMWZBpiqn62SQDlsZZVLnEgg_AA$ We special case some nodes where we need to add additional nodes (grandchildren or even further down) back to the worklist to not miss updating them, for example, the `Phis` when the use is a `Region`: https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1789-L1796__;Iw!!ACWV5N9M2RV99hQ!O0NM0CMDn0zSyzJpqn4Xm9GQAaL2ENMzxotX4HKhK4Z7a5AvAUD0c_QkNp5DyrUKENfCThYgMWZBpiqn62SQDlsZZVIRMQVGfg$ However, we miss to special case an `LShift` use if the shift value (right-hand side input of the `LShift`) changed. We should add all `AndI/L` nodes back to the worklist to account for the `AndI/L::Value()` optimization. Not doing so can result in a wrong execution as shown with the testcase. We have the following nodes: ![Screenshot from 2022-06-24 10-28-41](https://urldefense.com/v3/__https://user-images.githubusercontent.com/17833009/175496296-4280e26b-6f2f-4ddc-b164-b9e887a5d437.png__;!!ACWV5N9M2RV99hQ!O0NM0CMDn0zSyzJpqn4Xm9GQAaL2ENMzxotX4HKhK4Z7a5AvAUD0c_QkNp5DyrUKENfCThYgMWZBpiqn62SQDlsZZVJxs4wjKw$ ) The `LShiftI` node gets `int` as type (i.e. bottom) and is not put back on the worklist again since the type cannot improve anymore. Afterwards, we process the `AndI` node and call `AndI::Value()`. At this point, the phi node still has the temporary type `int:62`. We apply the optimization that the shifted number and the mask are bitwise disjoint and we set the type of the `AndI` node to `int:0`. When later reapplying `Phi::Value()` for the phi node, we correct the type to `int:62..69` and try to push the `LShiftI` node use back to the worklist. Since its type is `int`, we do not add it again. At this point, `AndI` is not on the CCP worklist anymore and neither will we push the `AndI` node to it again. We miss to reapply `AndI::Value()` and correct the now wrong `Value()` optimization. We keep `int:0` as type and replace the `AndI` node by constant zero - leading to a wrong execution. Special casing `LShift` -> `AndNodes` in CCP fixes the problem to make sure we reapply `AndI/L::Value()` again. I've applied some more refactorings and comment improvements but since this fix is for JDK 19, I've decided to separate them into an RFE ([JDK-8289051](https://bugs.openjdk.org/browse/JDK-8289051)) to reduce the risk. At some point, we should add some additional verification to find missed `Value()` calls in CCP to avoid similar problems in the future (see [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197)). Thanks, Christian ------------- Commit messages: - 8288683: C2: And node gets wrong type due to not adding it back to the worklist in CCP Changes: https://git.openjdk.org/jdk19/pull/65/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=65&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288683 Stats: 110 lines in 3 files changed: 110 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk19/pull/65.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/65/head:pull/65 PR: https://git.openjdk.org/jdk19/pull/65 From tholenstein at openjdk.org Fri Jun 24 09:04:29 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 24 Jun 2022 09:04:29 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts Message-ID: *Improvement of keyboard shortcuts in IGV under macOS:*. Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." *Fixed focus of the Graph Tab:*. In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. *Fixing QuickSearch:* Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. *Adding new actions to expand/shrink the difference selection:*. The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!MJp2Y-REDkxxKQZOC-yU5NwdY7hfUWtIh-13ktSLfrt-mvwjzVBTHQkwf7OdOOSslglO1FUCgV907S60HCDky565mGfRuxOYuMgY$ ) This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!MJp2Y-REDkxxKQZOC-yU5NwdY7hfUWtIh-13ktSLfrt-mvwjzVBTHQkwf7OdOOSslglO1FUCgV907S60HCDky565mGfRu2U8OOkK$ ) By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. **Fixed shortcuts for:** - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) **Added new shortcuts for:** - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) - Undo (`Ctrl/Cmd - Z`) - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) - Show Next Graph (`Ctrl/Cmd - RIGHT`) - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) - Show Previous Graph (`Ctrl/Cmd - LEFT`) - Show satellite view (`Hold S`) ------------- Commit messages: - change order of Actions - improve shortcuts - JDK-8288750: IGV: Improve Shortcuts Changes: https://git.openjdk.org/jdk/pull/9260/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288750 Stats: 420 lines in 23 files changed: 202 ins; 175 del; 43 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From shade at openjdk.org Fri Jun 24 09:21:52 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 24 Jun 2022 09:21:52 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. [v2] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 11:59:14 GMT, Roland Westrelin wrote: >> With JDK-8277654, the load barrier slow path call doesn't produce raw >> memory anymore but the IU barrier call still does. I propose removing >> raw memory for that call too which also causes the assert that fails >> to be removed. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - new fix > - Merge branch 'master' into JDK-8287227 > - Revert "fix" > > This reverts commit aa6f80a7883ee7032f81dbffac5d0257491d7118. > - fix Looks good. I tested `jdk_loom hotspot_loom` with Linux x86_64 fastdebug Shenandoah, including IU mode, and they pass. Apart from the known `Skynet` failure that is tracked separately. Also ran Linux x86_64 fastdebug `tier1` with Shenandoah, with some unrelated failures. src/hotspot/share/gc/shenandoah/c2/shenandoahSupport.cpp line 2678: > 2676: void MemoryGraphFixer::record_new_ctrl(Node* ctrl, Node* new_ctrl, Node* mem, Node* mem_for_ctrl) { > 2677: if (mem_for_ctrl != mem) { > 2678: if (new_ctrl != ctrl) { This two branches look collapsible. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/8958 From tholenstein at openjdk.org Fri Jun 24 09:37:06 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 24 Jun 2022 09:37:06 GMT Subject: RFR: JDK-8287094: IGV: show node input numbers in edge tooltips Message-ID: For nodes with many inputs, such as safepoints, it is difficult and error-prone to figure out the exact input number of a given incoming edge. Extend the Edge Tooltips to include the input number of the destination node: **Before** `91 Addl -> 92 SafePoint` **Now** `91 Addl -> 92 SafePoint [NR]` ![edge](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175506945-6f5137d2-7647-4acb-a135-8fcb719df3e6.png__;!!ACWV5N9M2RV99hQ!KDSRtw31zJlEuOLaGO1VsvqSufmI5ugVFhDKhjYz_O1DF-efV8jny-QVBK3DS5zM3gpakQ21MRYc5riqtzZRThKZjAO4UajrE079$ ) ------------- Commit messages: - JDK-8287094: IGV: show node input numbers in edge tooltips Changes: https://git.openjdk.org/jdk/pull/9273/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9273&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287094 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9273.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9273/head:pull/9273 PR: https://git.openjdk.org/jdk/pull/9273 From shade at openjdk.org Fri Jun 24 09:39:54 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 24 Jun 2022 09:39:54 GMT Subject: RFR: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 09:20:43 GMT, Boris Ulasevich wrote: > Fixing ARM32 jtreg fails: > - compiler/floatingpoint/TestFloatJNIArgs.java > - runtime/logging/MonitorMismatchTest.java.MonitorMismatchTest > > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (c1_LIRAssembler_arm.cpp:1482), pid=21156, tid=21171 > # Error: ShouldNotReachHere() Looks fine. But since the triggering change (JDK-8288303) was delivered in JDK 19, this PR should be against JDK 19. ------------- PR: https://git.openjdk.org/jdk/pull/9258 From rpressler at openjdk.org Fri Jun 24 09:42:06 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Fri, 24 Jun 2022 09:42:06 GMT Subject: RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing Message-ID: Please review the following bug fix: `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. This change does three things: 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. 3. In interp_only_mode, the c2i stub will not patch the callsite. This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 Passes tiers 1-4 and Loom tiers 1-5. ------------- Commit messages: - fix - Remove outdated comment - Unexclude test Changes: https://git.openjdk.org/jdk19/pull/66/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=66&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288949 Stats: 55 lines in 9 files changed: 48 ins; 5 del; 2 mod Patch: https://git.openjdk.org/jdk19/pull/66.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/66/head:pull/66 PR: https://git.openjdk.org/jdk19/pull/66 From tholenstein at openjdk.org Fri Jun 24 10:03:55 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 24 Jun 2022 10:03:55 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v2] In-Reply-To: References: Message-ID: <-tTw3c0KkIYG5cdqmocxcax1aU62VOSacHbZQ6KgKbk=.adadf2bb-840a-486f-8c8b-19a08912855d@github.com> > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!Pdh8eLNG-2ckGxcm2halBtmUPi_bFVFMsEW7grGl3lvPrUsYJLbQvS7t6qJ7BwPuJ5BlzPsWiUQ-Sldc1iIK3TbQVDziIofTlutd$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!Pdh8eLNG-2ckGxcm2halBtmUPi_bFVFMsEW7grGl3lvPrUsYJLbQvS7t6qJ7BwPuJ5BlzPsWiUQ-Sldc1iIK3TbQVDziIhorPxZS$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) Tobias Holenstein has updated the pull request incrementally with two additional commits since the last revision: - code style v2 - code style ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9260/files - new: https://git.openjdk.org/jdk/pull/9260/files/92b714ae..d263328b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=00-01 Stats: 21 lines in 3 files changed: 3 ins; 5 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From chagedorn at openjdk.org Fri Jun 24 10:05:03 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 24 Jun 2022 10:05:03 GMT Subject: RFR: 8288897: Clean up node dump code In-Reply-To: References: Message-ID: On Wed, 22 Jun 2022 11:38:15 GMT, Emanuel Peter wrote: > I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). > > This change sets cleans up the code around, and tries to reduce code duplication. > > Things I did: > - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. > - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). > - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. > - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private > - Adding `call from debugger` comment to VM functions that are useful in debugger Otherwise, nice cleanup! I think it's the right thing to remove unused and unmaintained `dump` methods and reduce code duplication. Have you checked that the printed node order with `dump(X)` is the same as before? I'm not sure if that is a strong requirement. I'm just thinking about `PrintIdeal` with which we do: https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/17aacde50fb971bc686825772e29f6bfecadabda/src/hotspot/share/opto/compile.cpp*L554__;Iw!!ACWV5N9M2RV99hQ!K471rX2EvXAFlcgguMGGFY55CVA_Ml_yUOe0KOL_0YnJQB9uesmknkArF06I273Kmvn12zHkPVNky5kL6ehsGlNf6T73cgc2QA$ Some tools/scripts might depend on the previous order of `dump(X)`. But I'm currently not aware of any such order-dependent processing. For the IR framework, the node order does not matter and if I see that correctly, the dump of an individual node is the same as before. So, it should be fine. src/hotspot/share/opto/node.cpp line 1658: > 1656: } > 1657: > 1658: void find_node_by_dump(Node* start, const char* pattern) { Since we now only dump nodes, how about renaming this method to `dump_nodes_by_dump()` and only keep `find_node(s?)_by_dump()` to call it from the debugger? Same for the other changed `find*()` methods. src/hotspot/share/opto/node.cpp line 2205: > 2203: } > 2204: > 2205: bool PrintBFS::filter_category(const Node* n, Filter& filter) { Maybe you could add a method comment that you are not filtering on the category for `Mixed` but actually look at the outputs of it and also consider `is_CFG()`. src/hotspot/share/opto/node.cpp line 2220: > 2218: } > 2219: if (filter._other && t->has_category(Type::Category::Other)) { > 2220: return true; Just a suggestion: To make it clear that you are only special casing `Mixed` you could leave the `switch` statement and only do the additional checks for `Mixed`. Since this category check is specific to the filtering of `dump_bfs()` and not something you normally perform on a type, I suggest to move this function to the `Filter` class (if that's possible). This would also require to change the implementation of `has_category()` - if it's too complicated, just leave it as it is. It's fine like that. src/hotspot/share/opto/node.cpp line 2655: > 2653: // call from debugger: dump Node's inputs (or outputs if d negative) > 2654: void Node::dump(int d) const { > 2655: dump_bfs(abs(d), nullptr, (d>0) ? "+$" : "-$"); Suggestion: dump_bfs(abs(d), nullptr, (d > 0) ? "+$" : "-$"); src/hotspot/share/opto/node.cpp line 2661: > 2659: // call from debugger: dump Node's control inputs (or outputs if d negative) > 2660: void Node::dump_ctrl(int d) const { > 2661: dump_bfs(abs(d), nullptr, (d>0) ? "+$c" : "-$c"); Suggestion: dump_bfs(abs(d), nullptr, (d > 0) ? "+$c" : "-$c"); ------------- Changes requested by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9234 From tholenstein at openjdk.org Fri Jun 24 10:08:49 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 24 Jun 2022 10:08:49 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v3] In-Reply-To: References: Message-ID: > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!PIF3smxwcVXFH2TvyOx71EujH9W11g5RIzZVPo37U9hc_x_OM130ot81PGy9onBN6i5fP6duSrIAgRIKRvU1RHp8kJICApXYgpsG$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!PIF3smxwcVXFH2TvyOx71EujH9W11g5RIzZVPo37U9hc_x_OM130ot81PGy9onBN6i5fP6duSrIAgRIKRvU1RHp8kJICAmuljW6G$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: code style v3 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9260/files - new: https://git.openjdk.org/jdk/pull/9260/files/d263328b..78d5fc85 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=01-02 Stats: 50 lines in 10 files changed: 0 ins; 50 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From duke at openjdk.org Fri Jun 24 10:20:52 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Fri, 24 Jun 2022 10:20:52 GMT Subject: RFR: 8289071: Compute stub sizes outside of locks In-Reply-To: References: Message-ID: <-s3vMpZQzsdT4tCfQ7yINmHOya9QI7gy8dA_KdvqFE8=.734ce8cf-29b2-4fec-8b77-46dcaead8403@github.com> On Fri, 24 Jun 2022 00:05:25 GMT, Yi-Fan Tsai wrote: > 8289071: Compute stub sizes outside of locks src/hotspot/share/code/nmethod.cpp line 473: > 471: // create nmethod > 472: nmethod* nm = NULL; > 473: int native_nmethod_size = CodeBlob::allocation_size(code_buffer, sizeof(nmethod)); This change and another below are not calculation of stub sizes. You need either create another PR for them or update JDK-8289071 to include these cases. ------------- PR: https://git.openjdk.org/jdk/pull/9266 From tholenstein at openjdk.org Fri Jun 24 10:27:40 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 24 Jun 2022 10:27:40 GMT Subject: RFR: JDK-8287094: IGV: show node input numbers in edge tooltips [v2] In-Reply-To: References: Message-ID: > For nodes with many inputs, such as safepoints, it is difficult and error-prone to figure out the exact input number of a given incoming edge. > > Extend the Edge Tooltips to include the input number of the destination node: > **Before** `91 Addl -> 92 SafePoint` > **Now** `91 Addl -> 92 SafePoint [NR]` > ![edge](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175506945-6f5137d2-7647-4acb-a135-8fcb719df3e6.png__;!!ACWV5N9M2RV99hQ!L8ubNbTpCkj1YtX5aazQ4WUnHvsQHZMDQZfVz_r_3U2Yt_ebCOiJFG13m5ocQrDwFppD2y4lyGVpRdqGKJFbklAiBQL8v6VIVXEM$ ) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: CustomSelectAction added ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9273/files - new: https://git.openjdk.org/jdk/pull/9273/files/b3df3367..7bc37589 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9273&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9273&range=00-01 Stats: 67 lines in 1 file changed: 67 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9273.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9273/head:pull/9273 PR: https://git.openjdk.org/jdk/pull/9273 From duke at openjdk.org Fri Jun 24 10:27:47 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Fri, 24 Jun 2022 10:27:47 GMT Subject: RFR: 8289071: Compute stub sizes outside of locks In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 00:05:25 GMT, Yi-Fan Tsai wrote: > 8289071: Compute stub sizes outside of locks Could you please add the description of your changes? It would be very useful if the description briefly gives: - What the problem is. - What the solution is. - How you tested them. No need to copy all details from JDK-8289071, just enough to understand the context of the changes. ------------- PR: https://git.openjdk.org/jdk/pull/9266 From tholenstein at openjdk.org Fri Jun 24 10:33:38 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 24 Jun 2022 10:33:38 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v4] In-Reply-To: References: Message-ID: <-dVBSDDY7DAzlqXdt_7oXrQ80pR2LaIg2jl7a1KyGGU=.17e233d4-af55-4cb0-a489-fa54f19cd68c@github.com> > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!IVhiymTvNESYIrAxQ0mlo3n1WlzMi4IOHKeMfUE9tk7Pi_54tnfuBgwNw9sGqKcb8rtEG6EtWpaHoE8nGZZquGmK4SoB1hCACftz$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!IVhiymTvNESYIrAxQ0mlo3n1WlzMi4IOHKeMfUE9tk7Pi_54tnfuBgwNw9sGqKcb8rtEG6EtWpaHoE8nGZZquGmK4SoB1t1tg5lH$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: added missing CustomSelectAction.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9260/files - new: https://git.openjdk.org/jdk/pull/9260/files/78d5fc85..3559c392 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=02-03 Stats: 67 lines in 1 file changed: 67 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From tholenstein at openjdk.org Fri Jun 24 10:40:31 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 24 Jun 2022 10:40:31 GMT Subject: RFR: JDK-8287094: IGV: show node input numbers in edge tooltips [v3] In-Reply-To: References: Message-ID: <9aid5UrN-ca7cOUt4dO4B3Q_A9n5iRk73FoCHIBDObA=.101909f3-59d6-48cf-a3ac-0c8703e3748c@github.com> > For nodes with many inputs, such as safepoints, it is difficult and error-prone to figure out the exact input number of a given incoming edge. > > Extend the Edge Tooltips to include the input number of the destination node: > **Before** `91 Addl -> 92 SafePoint` > **Now** `91 Addl -> 92 SafePoint [NR]` > ![edge](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175506945-6f5137d2-7647-4acb-a135-8fcb719df3e6.png__;!!ACWV5N9M2RV99hQ!IbUAwrFUNxsWqUGkWKrjNfJwJD7vvnZ27pfTCgL0zcInkb2KzE8umvFi3UWugHmsGSL9jcjjS5g299OEJ8VfwvW0sROYQ7uOssaT$ ) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: Revert "CustomSelectAction added" This reverts commit 7bc3758905b50df39c188e1e2d90e222839ddedf. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9273/files - new: https://git.openjdk.org/jdk/pull/9273/files/7bc37589..2b9545d8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9273&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9273&range=01-02 Stats: 67 lines in 1 file changed: 0 ins; 67 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9273.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9273/head:pull/9273 PR: https://git.openjdk.org/jdk/pull/9273 From tholenstein at openjdk.org Fri Jun 24 10:44:15 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Fri, 24 Jun 2022 10:44:15 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v5] In-Reply-To: References: Message-ID: > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!JejKwQ13nFgeTSe10jpg5x8yd-_OG1lBI8gs5TK6z9zWzNsnER4CB2wAkawKs37CSYGKf9hZqj9NDO5vcv9-MlDB6gTYwSw4Jj7b$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!JejKwQ13nFgeTSe10jpg5x8yd-_OG1lBI8gs5TK6z9zWzNsnER4CB2wAkawKs37CSYGKf9hZqj9NDO5vcv9-MlDB6gTYwc-pBvhq$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: added missing files ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9260/files - new: https://git.openjdk.org/jdk/pull/9260/files/3559c392..42653123 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=03-04 Stats: 203 lines in 2 files changed: 203 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From bulasevich at openjdk.org Fri Jun 24 10:56:24 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 24 Jun 2022 10:56:24 GMT Subject: RFR: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support Message-ID: Fixing ARM32 jtreg fails: - compiler/floatingpoint/TestFloatJNIArgs.java - runtime/logging/MonitorMismatchTest.java.MonitorMismatchTest # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (c1_LIRAssembler_arm.cpp:1482), pid=21156, tid=21171 # Error: ShouldNotReachHere() ------------- Commit messages: - 8289044: ARM32: missing LIR_Assembler::cmove metadata type support Changes: https://git.openjdk.org/jdk19/pull/67/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=67&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289044 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk19/pull/67.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/67/head:pull/67 PR: https://git.openjdk.org/jdk19/pull/67 From bulasevich at openjdk.org Fri Jun 24 10:57:42 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 24 Jun 2022 10:57:42 GMT Subject: RFR: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:36:07 GMT, Aleksey Shipilev wrote: > Looks fine. But since the triggering change (JDK-8288303) was delivered in JDK 19, this PR should be against JDK 19. You are right. Will the change be automatically pushed to the openjdk/jdk repo? Should I close this PR? Here is the PR against JDK 19: https://urldefense.com/v3/__https://github.com/openjdk/jdk19/pull/67__;!!ACWV5N9M2RV99hQ!KIJ0ayM_-O5321kdShRJVADtCWhSqE-CLIAqCHvzMT0zLvtT_g0Ga8Tv8JowzTWPZNN2_qcx6wDn9j8qAsUpMJKgNMy8xfMi8yQ$ ------------- PR: https://git.openjdk.org/jdk/pull/9258 From shade at openjdk.org Fri Jun 24 12:05:58 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 24 Jun 2022 12:05:58 GMT Subject: RFR: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 10:46:59 GMT, Boris Ulasevich wrote: > Fixing ARM32 jtreg fails: > - compiler/floatingpoint/TestFloatJNIArgs.java > - runtime/logging/MonitorMismatchTest.java.MonitorMismatchTest > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (c1_LIRAssembler_arm.cpp:1482), pid=21156, tid=21171 > # Error: ShouldNotReachHere() Looks fine. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk19/pull/67 From shade at openjdk.org Fri Jun 24 12:06:06 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 24 Jun 2022 12:06:06 GMT Subject: RFR: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 10:54:06 GMT, Boris Ulasevich wrote: > You are right. Will the change be automatically pushed to the openjdk/jdk repo? Should I close this PR? Yes and yes. ------------- PR: https://git.openjdk.org/jdk/pull/9258 From bulasevich at openjdk.org Fri Jun 24 12:10:58 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 24 Jun 2022 12:10:58 GMT Subject: Withdrawn: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 09:20:43 GMT, Boris Ulasevich wrote: > Fixing ARM32 jtreg fails: > - compiler/floatingpoint/TestFloatJNIArgs.java > - runtime/logging/MonitorMismatchTest.java.MonitorMismatchTest > > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (c1_LIRAssembler_arm.cpp:1482), pid=21156, tid=21171 > # Error: ShouldNotReachHere() This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/9258 From bulasevich at openjdk.org Fri Jun 24 12:11:47 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 24 Jun 2022 12:11:47 GMT Subject: RFR: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 12:02:38 GMT, Aleksey Shipilev wrote: > Looks fine. Thank you! ------------- PR: https://git.openjdk.org/jdk19/pull/67 From chagedorn at openjdk.org Fri Jun 24 12:21:01 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 24 Jun 2022 12:21:01 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v5] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 10:44:15 GMT, Tobias Holenstein wrote: >> *Improvement of keyboard shortcuts in IGV under macOS:*. >> Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." >> >> *Fixed focus of the Graph Tab:*. >> In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. >> >> *Fixing QuickSearch:* >> Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. >> >> *Adding new actions to expand/shrink the difference selection:*. >> The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. >> ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!N142EM6_QwoZtEcdjJqU_3zpPGfL4TAySRkgnVPxYYuhJLjDeNbFLI2hqN5EpMKPeKWKPQyEoFRBNFId1mdkjp3_LZoaeKkK9g$ ) >> This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. >> ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!N142EM6_QwoZtEcdjJqU_3zpPGfL4TAySRkgnVPxYYuhJLjDeNbFLI2hqN5EpMKPeKWKPQyEoFRBNFId1mdkjp3_LZrtoM6ZwA$ ) >> By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. >> >> **Fixed shortcuts for:** >> - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) >> - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) >> - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) >> >> **Added new shortcuts for:** >> - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) >> - Undo (`Ctrl/Cmd - Z`) >> - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) >> - Show Next Graph (`Ctrl/Cmd - RIGHT`) >> - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) >> - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) >> - Show Previous Graph (`Ctrl/Cmd - LEFT`) >> - Show satellite view (`Hold S`) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > added missing files That's a nice improvement, thanks for fixing these and adding new shortcuts together with icons! I've tried the shortcuts out and they seem to work fine (tested on Ubuntu 20.04). It makes the workflow a lot easier. I only have some minor code style comments, otherwise it looks good - but I'm not very familiar with the IGV code. src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/actions/ImportAction.java line 171: > 169: > 170: @Override > 171: protected boolean asynchronous() { return false; } Suggestion: protected boolean asynchronous() { return false; } src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java line 347: > 345: centerPanel.getActionMap().put("showSatellite", > 346: new AbstractAction("showSatellite") { > 347: @Override public void actionPerformed(ActionEvent e) { Suggestion: @Override public void actionPerformed(ActionEvent e) { src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java line 356: > 354: centerPanel.getActionMap().put("showScene", > 355: new AbstractAction("showScene") { > 356: @Override public void actionPerformed(ActionEvent e) { Suggestion: @Override public void actionPerformed(ActionEvent e) { src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java line 372: > 370: centerPanel.add(SATELLITE_STRING, satelliteComponent); > 371: > 372: New line can be removed. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/CustomSelectAction.java line 50: > 48: } > 49: > 50: protected int getModifierMask () { Suggestion: protected int getModifierMask() { src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/CustomSelectAction.java line 55: > 53: > 54: @Override > 55: public State mousePressed (Widget widget, WidgetMouseEvent event) { Suggestion: public State mousePressed(Widget widget, WidgetMouseEvent event) { src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/ExpandDiffAction.java line 41: > 39: } > 40: > 41: public ExpandDiffAction(Lookup lookup) { `lookup` seems unused. Can be removed? src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/OverviewAction.java line 43: > 41: public OverviewAction() { > 42: putValue(AbstractAction.SMALL_ICON, new ImageIcon(ImageUtilities.loadImage(iconResource()))); > 43: putValue(Action.SHORT_DESCRIPTION, "Show satellite view of whole graph (hold S-KEY"); Suggestion: putValue(Action.SHORT_DESCRIPTION, "Show satellite view of whole graph (hold S-KEY)"); src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/ShrinkDiffAction.java line 40: > 38: } > 39: > 40: public ShrinkDiffAction(Lookup lookup) { `lookup` seems unused. Can be removed? src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/ShrinkDiffAction.java line 67: > 65: int nfp = fp; > 66: int nsp = (fp < sp) ? sp - 1 : sp; > 67: model.setPositions(nfp, nsp); `nfp` can be inlined directly. Maybe you want to rename the variables instead of using abbreviations. Same in `ExpandDiffAction`. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9260 From chagedorn at openjdk.org Fri Jun 24 12:33:54 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 24 Jun 2022 12:33:54 GMT Subject: RFR: JDK-8287094: IGV: show node input numbers in edge tooltips [v3] In-Reply-To: <9aid5UrN-ca7cOUt4dO4B3Q_A9n5iRk73FoCHIBDObA=.101909f3-59d6-48cf-a3ac-0c8703e3748c@github.com> References: <9aid5UrN-ca7cOUt4dO4B3Q_A9n5iRk73FoCHIBDObA=.101909f3-59d6-48cf-a3ac-0c8703e3748c@github.com> Message-ID: On Fri, 24 Jun 2022 10:40:31 GMT, Tobias Holenstein wrote: >> For nodes with many inputs, such as safepoints, it is difficult and error-prone to figure out the exact input number of a given incoming edge. >> >> Extend the Edge Tooltips to include the input number of the destination node: >> **Before** `91 Addl -> 92 SafePoint` >> **Now** `91 Addl -> 92 SafePoint [NR]` >> ![edge](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175506945-6f5137d2-7647-4acb-a135-8fcb719df3e6.png__;!!ACWV5N9M2RV99hQ!M3TK1r7wTKRl1aVUPe1yKEG-uRTEXaaUbAXbdb4F1T-vgRhQm9yIRxxqhYSqFfDNQKVQ9fcTAIl-CUInuB72gXjAt95h48xBhw$ ) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > Revert "CustomSelectAction added" > > This reverts commit 7bc3758905b50df39c188e1e2d90e222839ddedf. Looks good! src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/FigureConnection.java line 108: > 106: builder.append(" ??? "); > 107: builder.append(getInputSlot().getFigure().getProperties().resolveString(shortNodeText)); > 108: builder.append(" [" + getInputSlot().getPosition() + "]"); String concatenation could be replaced by `append()` calls to follow the pattern above (the `builder.` can also be removed and the calls be chained together directly): Suggestion: builder.append(" [") .append(getInputSlot().getPosition()) .append("]"); ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9273 From stuefe at openjdk.org Fri Jun 24 13:21:46 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 24 Jun 2022 13:21:46 GMT Subject: RFR: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 10:46:59 GMT, Boris Ulasevich wrote: > Fixing ARM32 jtreg fails: > - compiler/floatingpoint/TestFloatJNIArgs.java > - runtime/logging/MonitorMismatchTest.java.MonitorMismatchTest > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (c1_LIRAssembler_arm.cpp:1482), pid=21156, tid=21171 > # Error: ShouldNotReachHere() +1 ------------- Marked as reviewed by stuefe (Reviewer). PR: https://git.openjdk.org/jdk19/pull/67 From bulasevich at openjdk.org Fri Jun 24 13:39:54 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 24 Jun 2022 13:39:54 GMT Subject: Integrated: 8289044: ARM32: missing LIR_Assembler::cmove metadata type support In-Reply-To: References: Message-ID: <1AibZzDT4f-UGNzseSW4i05kfCZp_4xdanG3Uf_j694=.edae2107-96d7-4618-8699-3851223a8e39@github.com> On Fri, 24 Jun 2022 10:46:59 GMT, Boris Ulasevich wrote: > Fixing ARM32 jtreg fails: > - compiler/floatingpoint/TestFloatJNIArgs.java > - runtime/logging/MonitorMismatchTest.java.MonitorMismatchTest > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (c1_LIRAssembler_arm.cpp:1482), pid=21156, tid=21171 > # Error: ShouldNotReachHere() This pull request has now been integrated. Changeset: 20f55abd Author: Boris Ulasevich URL: https://git.openjdk.org/jdk19/commit/20f55abd2744323a756872e080885d107e6c56e5 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod 8289044: ARM32: missing LIR_Assembler::cmove metadata type support Reviewed-by: shade, stuefe ------------- PR: https://git.openjdk.org/jdk19/pull/67 From epeter at openjdk.org Fri Jun 24 13:50:55 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 24 Jun 2022 13:50:55 GMT Subject: RFR: 8288897: Clean up node dump code In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:05:47 GMT, Christian Hagedorn wrote: >> I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). >> >> This change sets cleans up the code around, and tries to reduce code duplication. >> >> Things I did: >> - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. >> - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). >> - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. >> - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private >> - Adding `call from debugger` comment to VM functions that are useful in debugger > > src/hotspot/share/opto/node.cpp line 1658: > >> 1656: } >> 1657: >> 1658: void find_node_by_dump(Node* start, const char* pattern) { > > Since we now only dump nodes, how about renaming this method to `dump_nodes_by_dump()` and only keep `find_node(s?)_by_dump()` to call it from the debugger? Same for the other changed `find*()` methods. I did think about renaming it do `dump_...`. But then I also find it important that the name says that we do filter / search / find. ------------- PR: https://git.openjdk.org/jdk/pull/9234 From epeter at openjdk.org Fri Jun 24 13:54:55 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 24 Jun 2022 13:54:55 GMT Subject: RFR: 8288897: Clean up node dump code [v2] In-Reply-To: References: Message-ID: > I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). > > This change sets cleans up the code around, and tries to reduce code duplication. > > Things I did: > - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. > - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). > - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. > - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private > - Adding `call from debugger` comment to VM functions that are useful in debugger Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review 2 style fixes by Christian Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9234/files - new: https://git.openjdk.org/jdk/pull/9234/files/80fe17db..1a836616 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9234.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9234/head:pull/9234 PR: https://git.openjdk.org/jdk/pull/9234 From aph at openjdk.org Fri Jun 24 14:14:03 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 24 Jun 2022 14:14:03 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg Message-ID: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> Like class `Register`, class `VMReg` exhibits undefined behaviour, in particular null pointer dereferences. The right way to fix this is simple: make instances of `VMReg` point to reified instances of `VMRegImpl`. We do this by creating a static array of `VMRegImpl`, and making all `VMReg` instances point into it, making the code well defined. However, while `VMReg` instances are no longer null, and so do not generate compile warnings or errors, there is still a problem in that higher-numbered `VMReg` instances point outside the static array of `VMRegImpl`. This is hard to avoid, given that (as far as I can tell) there is no upper limit on the number of stack slots that can be allocated as `VMReg` instances. While this is in theory UB, it's not likely to cause problems. We could fix this by creating a much larger static array of `VMRegImpl`, up to the largest plausible size of stack offsets. We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. ------------- Commit messages: - 8289060: Undefined Behaviour in class VMReg - First Changes: https://git.openjdk.org/jdk/pull/9276/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9276&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289060 Stats: 28 lines in 2 files changed: 14 ins; 2 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/9276.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9276/head:pull/9276 PR: https://git.openjdk.org/jdk/pull/9276 From aph at openjdk.org Fri Jun 24 14:25:55 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 24 Jun 2022 14:25:55 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v3] In-Reply-To: References: Message-ID: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: Cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/36ba30bc..948cda18 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From duke at openjdk.org Fri Jun 24 17:16:37 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Fri, 24 Jun 2022 17:16:37 GMT Subject: RFR: 8289071: Compute stub sizes outside of locks [v2] In-Reply-To: References: Message-ID: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> > 8289071: Compute stub sizes outside of locks Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' of https://urldefense.com/v3/__https://github.com/yftsai/jdk__;!!ACWV5N9M2RV99hQ!JEWmh7FO8TQweNHNFY7QTI1LK6CggQr-67lLHvFzcityNTwt_cSOjilEi7LVgzAPcKxLwqlcVttpnuFWC9GFfVS30NQ$ into JDK-8289071 - 8289071: Compute stub sizes outside of locks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9266/files - new: https://git.openjdk.org/jdk/pull/9266/files/4102427f..ed8c4acd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9266&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9266&range=00-01 Stats: 1456 lines in 46 files changed: 822 ins; 545 del; 89 mod Patch: https://git.openjdk.org/jdk/pull/9266.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9266/head:pull/9266 PR: https://git.openjdk.org/jdk/pull/9266 From rahul.kandu at intel.com Fri Jun 24 17:18:47 2022 From: rahul.kandu at intel.com (Kandu, Rahul) Date: Fri, 24 Jun 2022 17:18:47 +0000 Subject: RFR: 8289071: Compute stub sizes outside of locks [v2] In-Reply-To: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> References: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> Message-ID: How do I unsubscribe from this mailing list? -----Original Message----- From: hotspot-compiler-dev On Behalf Of Yi-Fan Tsai Sent: Friday, June 24, 2022 10:17 AM To: hotspot-compiler-dev at openjdk.org Subject: Re: RFR: 8289071: Compute stub sizes outside of locks [v2] > 8289071: Compute stub sizes outside of locks Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' of https://urldefense.com/v3/__https://github.com/yftsai/jdk__;!!ACWV5N9M2RV99hQ!JEWmh7FO8TQweNHNFY7QTI1LK6CggQr-67lLHvFzcityNTwt_cSOjilEi7LVgzAPcKxLwqlcVttpnuFWC9GFfVS30NQ$ into JDK-8289071 - 8289071: Compute stub sizes outside of locks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9266/files - new: https://git.openjdk.org/jdk/pull/9266/files/4102427f..ed8c4acd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9266&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9266&range=00-01 Stats: 1456 lines in 46 files changed: 822 ins; 545 del; 89 mod Patch: https://git.openjdk.org/jdk/pull/9266.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9266/head:pull/9266 PR: https://git.openjdk.org/jdk/pull/9266 From rpressler at openjdk.org Fri Jun 24 20:42:56 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Fri, 24 Jun 2022 20:42:56 GMT Subject: RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing In-Reply-To: References: Message-ID: <7PH9bVJW-6hL7tAz5TYvk6qu9RfUPGaLLgkbwnkS3U8=.40f89a35-f43f-4826-981c-7a6a36e7b42e@github.com> On Fri, 24 Jun 2022 09:23:26 GMT, Ron Pressler wrote: > Please review the following bug fix: > > `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. > > Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. > > This change does three things: > > 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. > 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. > 3. In interp_only_mode, the c2i stub will not patch the callsite. > > This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 > > > Passes tiers 1-4 and Loom tiers 1-5. src/hotspot/share/code/compiledIC.cpp line 591: > 589: // Do not reset stub here: It is too expensive to call find_stub. > 590: // Instead, rely on caller (nmethod::clear_inline_caches) to clear > 591: // both the call and its stub. While at it, I noticed this comment, which appears to be out of date. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From dlong at openjdk.org Sat Jun 25 00:45:41 2022 From: dlong at openjdk.org (Dean Long) Date: Sat, 25 Jun 2022 00:45:41 GMT Subject: RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing In-Reply-To: <7PH9bVJW-6hL7tAz5TYvk6qu9RfUPGaLLgkbwnkS3U8=.40f89a35-f43f-4826-981c-7a6a36e7b42e@github.com> References: <7PH9bVJW-6hL7tAz5TYvk6qu9RfUPGaLLgkbwnkS3U8=.40f89a35-f43f-4826-981c-7a6a36e7b42e@github.com> Message-ID: On Fri, 24 Jun 2022 20:39:08 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > src/hotspot/share/code/compiledIC.cpp line 591: > >> 589: // Do not reset stub here: It is too expensive to call find_stub. >> 590: // Instead, rely on caller (nmethod::clear_inline_caches) to clear >> 591: // both the call and its stub. > > While at it, I noticed this comment, which appears to be out of date. I read that comment more as a warning, in case in the future someone wondered why we don't reset the stub here, and tried to add it. So I would leave it. I think it's still useful. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From dlong at openjdk.org Sat Jun 25 01:17:57 2022 From: dlong at openjdk.org (Dean Long) Date: Sat, 25 Jun 2022 01:17:57 GMT Subject: RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:23:26 GMT, Ron Pressler wrote: > Please review the following bug fix: > > `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. > > Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. > > This change does three things: > > 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. > 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. > 3. In interp_only_mode, the c2i stub will not patch the callsite. > > This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 > > > Passes tiers 1-4 and Loom tiers 1-5. src/hotspot/share/code/compiledMethod.cpp line 464: > 462: while(iter.next()) { > 463: if (iter.type() == relocInfo::static_call_type) { > 464: iter.reloc()->clear_inline_cache(); This relies on code patching, and for correctness the change must be seen by the thread requesting interpreter-only mode. If this was being done at a safepoint then it would probably be OK. However, this code seems to be done using a handshake, so I'm not sure if the required serializing instruction is guaranteed to happen (see JDK-8220351). src/hotspot/share/runtime/mutexLocker.cpp line 287: > 285: def(JfieldIdCreation_lock , PaddedMutex , safepoint); > 286: > 287: def(CompiledIC_lock , PaddedMutex , nosafepoint-1); // locks VtableStubs_lock, InlineCacheBuffer_lock Please explain. Is there another lock causing problems? ------------- PR: https://git.openjdk.org/jdk19/pull/66 From rpressler at openjdk.org Sat Jun 25 01:23:47 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Sat, 25 Jun 2022 01:23:47 GMT Subject: RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: > Please review the following bug fix: > > `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. > > Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. > > This change does three things: > > 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. > 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. > 3. In interp_only_mode, the c2i stub will not patch the callsite. > > This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 > > > Passes tiers 1-4 and Loom tiers 1-5. Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: Revert "Remove outdated comment" This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/66/files - new: https://git.openjdk.org/jdk19/pull/66/files/fe8fe94f..4680aed2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=66&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=66&range=00-01 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk19/pull/66.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/66/head:pull/66 PR: https://git.openjdk.org/jdk19/pull/66 From dlong at openjdk.org Sat Jun 25 01:23:49 2022 From: dlong at openjdk.org (Dean Long) Date: Sat, 25 Jun 2022 01:23:49 GMT Subject: RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:23:26 GMT, Ron Pressler wrote: > Please review the following bug fix: > > `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. > > Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. > > This change does three things: > > 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. > 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. > 3. In interp_only_mode, the c2i stub will not patch the callsite. > > This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 > > > Passes tiers 1-4 and Loom tiers 1-5. The get_c2i_entry change seems safe enough, but the lock rank change and the code patching changes seem a little risky for jdk19. I'm going to suggest some folks more familiar with handshakes, compiled ICs, and lock ranking to also look at this. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From rpressler at openjdk.org Sat Jun 25 01:23:52 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Sat, 25 Jun 2022 01:23:52 GMT Subject: RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: <7PH9bVJW-6hL7tAz5TYvk6qu9RfUPGaLLgkbwnkS3U8=.40f89a35-f43f-4826-981c-7a6a36e7b42e@github.com> Message-ID: On Sat, 25 Jun 2022 00:42:14 GMT, Dean Long wrote: >> src/hotspot/share/code/compiledIC.cpp line 591: >> >>> 589: return true; >>> 590: } >>> 591: >> >> While at it, I noticed this comment, which appears to be out of date. > > I read that comment more as a warning, in case in the future someone wondered why we don't reset the stub here, and tried to add it. So I would leave it. I think it's still useful. ok, reverted. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From rpressler at openjdk.org Sat Jun 25 01:23:53 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Sat, 25 Jun 2022 01:23:53 GMT Subject: RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 01:14:25 GMT, Dean Long wrote: >> Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert "Remove outdated comment" >> >> This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. > > src/hotspot/share/runtime/mutexLocker.cpp line 287: > >> 285: def(JfieldIdCreation_lock , PaddedMutex , safepoint); >> 286: >> 287: def(CompiledIC_lock , PaddedMutex , nosafepoint-1); // locks VtableStubs_lock, InlineCacheBuffer_lock > > Please explain. Is there another lock causing problems? The handshake lock, which is also nosafepoint. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From iveresov at openjdk.org Sat Jun 25 04:43:26 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 25 Jun 2022 04:43:26 GMT Subject: RFR: 8289069: Very slow C1 arraycopy jcstress tests after JDK-8279886Size bitmaps appropriately Message-ID: I used BlockBegin::number_of_blocks() to size the bitmaps, however that is a total number of blocks. Since mark_loops() is called after every inlining (for every inlinee - no need to reanalyze the whole method), the bitmaps get progressively larger, and have to be zeroed. That makes the complexity quadratic. The solution is to appropriately size the bitmaps and keep the whole thing linear. Tests look good. ------------- Commit messages: - Size bitmaps appropriately Changes: https://git.openjdk.org/jdk19/pull/72/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=72&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289069 Stats: 31 lines in 1 file changed: 10 ins; 0 del; 21 mod Patch: https://git.openjdk.org/jdk19/pull/72.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/72/head:pull/72 PR: https://git.openjdk.org/jdk19/pull/72 From haosun at openjdk.org Mon Jun 27 00:48:53 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 27 Jun 2022 00:48:53 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v5] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 07:34:57 GMT, Dean Long wrote: >> The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/codegen/ShiftByZero.java > > Co-authored-by: Hao Sun Marked as reviewed by haosun (Author). ------------- PR: https://git.openjdk.org/jdk19/pull/40 From xgong at openjdk.org Mon Jun 27 01:42:40 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 27 Jun 2022 01:42:40 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: <7ZzX5PQiQYJPNIt-2bhMEC9XliUiWtIN42UcS3YAd8k=.db4e50a5-2d17-4a4c-8eb6-23d68a4c486e@github.com> On Mon, 20 Jun 2022 07:50:09 GMT, Xiaohong Gong wrote: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". Hi, could anyone please help to take a look at this simple patch? Thanks a lot for your time! ------------- PR: https://git.openjdk.org/jdk/pull/9211 From xgong at openjdk.org Mon Jun 27 01:44:37 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 27 Jun 2022 01:44:37 GMT Subject: RFR: 8287984: AArch64: [vector] Make all bits set vector sharable for match rules Message-ID: We have the optimized rules for vector not/and_not in NEON and SVE, like: match(Set dst (XorV src (ReplicateB m1))) ; vector not match(Set dst (AndV src1 (XorV src2 (ReplicateB m1)))) ; vector and_not where "`m1`" is a ConI node with value -1. And we also have the similar rules for vector mask in SVE like: match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))) ; mask and_not These rules are not easy to be matched since the "`Replicate`" or "`MaskAll`" node is usually not single used for the `not/and_not` operation. To make these rules be matched as expected, this patch adds the vector (mask) "`not`" pattern to `Matcher::pd_clone_node()` which makes the all bits set vector `(Replicate/MaskAll)` sharable during matching rules. ------------- Commit messages: - 8287984: AArch64: [vector] Make all bits set vector sharable for match rules Changes: https://git.openjdk.org/jdk/pull/9292/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9292&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287984 Stats: 138 lines in 3 files changed: 127 ins; 1 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/9292.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9292/head:pull/9292 PR: https://git.openjdk.org/jdk/pull/9292 From njian at openjdk.org Mon Jun 27 05:11:55 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Mon, 27 Jun 2022 05:11:55 GMT Subject: RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v5] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 07:34:57 GMT, Dean Long wrote: >> The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/compiler/codegen/ShiftByZero.java > > Co-authored-by: Hao Sun Looks good to me. ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.org/jdk19/pull/40 From chagedorn at openjdk.org Mon Jun 27 07:04:55 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 27 Jun 2022 07:04:55 GMT Subject: RFR: 8288897: Clean up node dump code [v2] In-Reply-To: References: Message-ID: <6IwOmnPSo59TyFBjrN_xGvQCoOL-0z7KmGLJ0WD64jI=.77dec7b9-7cff-4c57-b748-5267908f4d37@github.com> On Fri, 24 Jun 2022 13:47:24 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/node.cpp line 1658: >> >>> 1656: } >>> 1657: >>> 1658: void find_node_by_dump(Node* start, const char* pattern) { >> >> Since we now only dump nodes, how about renaming this method to `dump_nodes_by_dump()` and only keep `find_node(s?)_by_dump()` to call it from the debugger? Same for the other changed `find*()` methods. > > I did think about renaming it do `dump_...`. But then I also find it important that the name says that we do filter / search / find. The filter/search action is probably implied but I don't have a strong opinion about it - it's fine to leave the name like that. But I suggest to make it plural (`find_nodes_by_dump()`) as we are possibly returning multiple nodes. ------------- PR: https://git.openjdk.org/jdk/pull/9234 From tholenstein at openjdk.org Mon Jun 27 07:32:09 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Jun 2022 07:32:09 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v6] In-Reply-To: References: Message-ID: > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!MXjmiEdREC0uHTBaR9NvabgnsqnWmHhU_mKIYj6vFefSd8aAmcTjgHMDjOs_SnrTk2cj0cz79ndZ6rlUYeGAfBs2GHdJ7R8XN4QA$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!MXjmiEdREC0uHTBaR9NvabgnsqnWmHhU_mKIYj6vFefSd8aAmcTjgHMDjOs_SnrTk2cj0cz79ndZ6rlUYeGAfBs2GHdJ7Qu1VhNz$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) Tobias Holenstein has updated the pull request incrementally with five additional commits since the last revision: - Update src/utils/IdealGraphVisualizer/Coordinator/src/main/java/com/sun/hotspot/igv/coordinator/actions/ImportAction.java code style Co-authored-by: Christian Hagedorn - Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/CustomSelectAction.java code style Co-authored-by: Christian Hagedorn - Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/CustomSelectAction.java code style Co-authored-by: Christian Hagedorn - Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/OverviewAction.java added missing ")" Co-authored-by: Christian Hagedorn - Update src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java code style Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9260/files - new: https://git.openjdk.org/jdk/pull/9260/files/42653123..c9dbe843 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=04-05 Stats: 8 lines in 4 files changed: 3 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From tholenstein at openjdk.org Mon Jun 27 07:32:11 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Jun 2022 07:32:11 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v5] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 12:03:14 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> added missing files > > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java line 347: > >> 345: centerPanel.getActionMap().put("showSatellite", >> 346: new AbstractAction("showSatellite") { >> 347: @Override public void actionPerformed(ActionEvent e) { > > Suggestion: > > @Override > public void actionPerformed(ActionEvent e) { fixed ------------- PR: https://git.openjdk.org/jdk/pull/9260 From tholenstein at openjdk.org Mon Jun 27 08:02:04 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Jun 2022 08:02:04 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v7] In-Reply-To: References: Message-ID: > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!KpV4Kjpb0FELREAOymho40kJPHfgLDKzrW-qVieYKEuRTpPgzREe80ebkQFfnn8b5Dw8W0J-zM21oob5tQp4H3_Ts7K5PIQTqiZn$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!KpV4Kjpb0FELREAOymho40kJPHfgLDKzrW-qVieYKEuRTpPgzREe80ebkQFfnn8b5Dw8W0J-zM21oob5tQp4H3_Ts7K5PJ-fwAo1$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: code style ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9260/files - new: https://git.openjdk.org/jdk/pull/9260/files/c9dbe843..fed782ad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=05-06 Stats: 29 lines in 6 files changed: 0 ins; 20 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From tholenstein at openjdk.org Mon Jun 27 08:11:55 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Jun 2022 08:11:55 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v8] In-Reply-To: References: Message-ID: > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!OYyqcXnPuv39sEFmPzTUCGYsUduSv_ow1GnwfBKuLkX0v7M8-mRDjeEFofBatHpUTVnnZTh3lX875kpSSp0i835R1yAoEfrkvX_R$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!OYyqcXnPuv39sEFmPzTUCGYsUduSv_ow1GnwfBKuLkX0v7M8-mRDjeEFofBatHpUTVnnZTh3lX875kpSSp0i835R1yAoEQf2TMrQ$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: fix whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9260/files - new: https://git.openjdk.org/jdk/pull/9260/files/fed782ad..c80ee6d7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=06-07 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From eosterlund at openjdk.org Mon Jun 27 08:29:31 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 27 Jun 2022 08:29:31 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 01:12:57 GMT, Dean Long wrote: >> Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert "Remove outdated comment" >> >> This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. > > src/hotspot/share/code/compiledMethod.cpp line 464: > >> 462: while(iter.next()) { >> 463: if (iter.type() == relocInfo::static_call_type) { >> 464: iter.reloc()->clear_inline_cache(); > > This relies on code patching, and for correctness the change must be seen by the thread requesting interpreter-only mode. If this was being done at a safepoint then it would probably be OK. However, this code seems to be done using a handshake, so I'm not sure if the required serializing instruction is guaranteed to happen (see JDK-8220351). Maybe this race is OK, as it seems no worse than the scenario described in the description where another thread resets the call site back to the optimized state. The change is not guaranteed to be seen on a concurrent thread, until the next global handshake operation completes. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From rpressler at openjdk.org Mon Jun 27 08:29:32 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Mon, 27 Jun 2022 08:29:32 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 08:17:01 GMT, Erik ?sterlund wrote: >> src/hotspot/share/code/compiledMethod.cpp line 464: >> >>> 462: while(iter.next()) { >>> 463: if (iter.type() == relocInfo::static_call_type) { >>> 464: iter.reloc()->clear_inline_cache(); >> >> This relies on code patching, and for correctness the change must be seen by the thread requesting interpreter-only mode. If this was being done at a safepoint then it would probably be OK. However, this code seems to be done using a handshake, so I'm not sure if the required serializing instruction is guaranteed to happen (see JDK-8220351). Maybe this race is OK, as it seems no worse than the scenario described in the description where another thread resets the call site back to the optimized state. > > The change is not guaranteed to be seen on a concurrent thread, until the next global handshake operation completes. If that concurrent thread is in interp_only_mode, it also would have done the same patching. And if it isn't, then it's okay for it not to see this, but if it does see it, it will re-patch to compiled in c2i, as in the description. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From thartmann at openjdk.org Mon Jun 27 08:51:40 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 27 Jun 2022 08:51:40 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v14] In-Reply-To: <_6iPSDvWGj8uGcVNGdwhRBa23bCVOVaMsUhY0crvxYM=.112ba1de-6a1a-417c-8446-3413a6ab8157@github.com> References: <_6iPSDvWGj8uGcVNGdwhRBa23bCVOVaMsUhY0crvxYM=.112ba1de-6a1a-417c-8446-3413a6ab8157@github.com> Message-ID: On Thu, 23 Jun 2022 23:08:20 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/2401/files*diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea__;Iw!!ACWV5N9M2RV99hQ!OtxdOpBLv76Gr9bQTvtZDIBZtFQVXZSG6g1tEcLI0_WS8BzL_iD2DAXkm_vJsIto91HOuMTVqWygVM_ei5H6On1q1f5NiVqBqw$ ), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 >> relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 >> main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 >> stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 >> oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 >> metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 >> scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 >> scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 >> dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 >> handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 >> nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> >> Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 >> relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 >> main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 >> stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 >> oops [0x00007f35189ab388,0x00007f35189ab390] = 8 >> metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 >> scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 >> scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 >> dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 >> handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 >> nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > remove _path from UnstableIfTrap. remember _next_bci(int) is enough. All tests passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/8545 From roland at openjdk.org Mon Jun 27 08:51:40 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 27 Jun 2022 08:51:40 GMT Subject: [jdk19] RFR: 8288683: C2: And node gets wrong type due to not adding it back to the worklist in CCP In-Reply-To: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> References: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> Message-ID: <3V6AOoANV6RlH1l5llKv-TPZr_pd0Hncn8Ro3a8VsYo=.86a68cff-6d98-4c5b-aba4-36d391a3b17f@github.com> On Fri, 24 Jun 2022 08:54:03 GMT, Christian Hagedorn wrote: > [JDK-8277850](https://bugs.openjdk.org/browse/JDK-8277850) added some new optimizations in `AndI/L::Value()` to optimize patterns similar to `(v << 2) & 3` which can directly be replaced by zero if the mask and the shifted value are bitwise disjoint. To do that, we look at the type of the shift value of the `LShift` input (right-hand side input): > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/mulnode.cpp*L1752-L1765__;Iw!!ACWV5N9M2RV99hQ!L6c44evhmqcgGmeGRK2z-wWycFEJ6Vq6DTG0I4WbwPKy7ehVx3ydKCZjVbiOSfJ7vcFud0ChrqURdAZM6nmG_T2GIrWYDg$ > > The optimization as such works fine but there is a problem in CCP. After calling `Value()` for a node in CCP, we generally only add the direct users of it back to the worklist if the type changed: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1812-L1814__;Iw!!ACWV5N9M2RV99hQ!L6c44evhmqcgGmeGRK2z-wWycFEJ6Vq6DTG0I4WbwPKy7ehVx3ydKCZjVbiOSfJ7vcFud0ChrqURdAZM6nmG_T0JLfl5Pg$ > > We special case some nodes where we need to add additional nodes (grandchildren or even further down) back to the worklist to not miss updating them, for example, the `Phis` when the use is a `Region`: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1789-L1796__;Iw!!ACWV5N9M2RV99hQ!L6c44evhmqcgGmeGRK2z-wWycFEJ6Vq6DTG0I4WbwPKy7ehVx3ydKCZjVbiOSfJ7vcFud0ChrqURdAZM6nmG_T18oKnazw$ > > However, we miss to special case an `LShift` use if the shift value (right-hand side input of the `LShift`) changed. We should add all `AndI/L` nodes back to the worklist to account for the `AndI/L::Value()` optimization. Not doing so can result in a wrong execution as shown with the testcase. We have the following nodes: > ![Screenshot from 2022-06-24 10-28-41](https://urldefense.com/v3/__https://user-images.githubusercontent.com/17833009/175496296-4280e26b-6f2f-4ddc-b164-b9e887a5d437.png__;!!ACWV5N9M2RV99hQ!L6c44evhmqcgGmeGRK2z-wWycFEJ6Vq6DTG0I4WbwPKy7ehVx3ydKCZjVbiOSfJ7vcFud0ChrqURdAZM6nmG_T0HljVWrw$ ) > > The `LShiftI` node gets `int` as type (i.e. bottom) and is not put back on the worklist again since the type cannot improve anymore. Afterwards, we process the `AndI` node and call `AndI::Value()`. At this point, the phi node still has the temporary type `int:62`. We apply the optimization that the shifted number and the mask are bitwise disjoint and we set the type of the `AndI` node to `int:0`. When later reapplying `Phi::Value()` for the phi node, we correct the type to `int:62..69` and try to push the `LShiftI` node use back to the worklist. Since its type is `int`, we do not add it again. At this point, `AndI` is not on the CCP worklist anymore and neither will we push the `AndI` node to it again. We miss to reapply `AndI::Value()` and correct the now wrong `Value()` optimization. We keep `int:0` as type and replace the `AndI` node by constant zero - leading to a wrong execution. > > Special casing `LShift` -> `AndNodes` in CCP fixes the problem to make sure we reapply `AndI/L::Value()` again. I've applied some more refactorings and comment improvements but since this fix is for JDK 19, I've decided to separate them into an RFE ([JDK-8289051](https://bugs.openjdk.org/browse/JDK-8289051)) to reduce the risk. > > At some point, we should add some additional verification to find missed `Value()` calls in CCP to avoid similar problems in the future (see [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197)). > > Thanks, > Christian Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk19/pull/65 From tholenstein at openjdk.org Mon Jun 27 08:59:18 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Jun 2022 08:59:18 GMT Subject: RFR: JDK-8287094: IGV: show node input numbers in edge tooltips [v4] In-Reply-To: References: Message-ID: > For nodes with many inputs, such as safepoints, it is difficult and error-prone to figure out the exact input number of a given incoming edge. > > Extend the Edge Tooltips to include the input number of the destination node: > **Before** `91 Addl -> 92 SafePoint` > **Now** `91 Addl -> 92 SafePoint [NR]` > ![edge](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175506945-6f5137d2-7647-4acb-a135-8fcb719df3e6.png__;!!ACWV5N9M2RV99hQ!NawfeF--fUveYf2Qrj8qWB4HyN_nNdlD4PYKbeaQBeFAQBvbZoVqcej6qx-pyIwsjjoEMFr4f__2imOj2Ra0AekxTpVvorbLV3-V$ ) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: Update src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/FigureConnection.java String concatenation Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9273/files - new: https://git.openjdk.org/jdk/pull/9273/files/2b9545d8..53585918 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9273&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9273&range=02-03 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9273.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9273/head:pull/9273 PR: https://git.openjdk.org/jdk/pull/9273 From roland at openjdk.org Mon Jun 27 09:17:58 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 27 Jun 2022 09:17:58 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. [v3] In-Reply-To: References: Message-ID: > With JDK-8277654, the load barrier slow path call doesn't produce raw > memory anymore but the IU barrier call still does. I propose removing > raw memory for that call too which also causes the assert that fails > to be removed. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - review - Merge branch 'master' into JDK-8287227 - new fix - Merge branch 'master' into JDK-8287227 - Revert "fix" This reverts commit aa6f80a7883ee7032f81dbffac5d0257491d7118. - fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8958/files - new: https://git.openjdk.org/jdk/pull/8958/files/5699e042..5675b766 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8958&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8958&range=01-02 Stats: 26420 lines in 785 files changed: 21213 ins; 2466 del; 2741 mod Patch: https://git.openjdk.org/jdk/pull/8958.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8958/head:pull/8958 PR: https://git.openjdk.org/jdk/pull/8958 From rkennke at openjdk.org Mon Jun 27 10:13:01 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 27 Jun 2022 10:13:01 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. [v3] In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 09:17:58 GMT, Roland Westrelin wrote: >> With JDK-8277654, the load barrier slow path call doesn't produce raw >> memory anymore but the IU barrier call still does. I propose removing >> raw memory for that call too which also causes the assert that fails >> to be removed. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-8287227 > - new fix > - Merge branch 'master' into JDK-8287227 > - Revert "fix" > > This reverts commit aa6f80a7883ee7032f81dbffac5d0257491d7118. > - fix Marked as reviewed by rkennke (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/8958 From thartmann at openjdk.org Mon Jun 27 10:59:46 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 27 Jun 2022 10:59:46 GMT Subject: [jdk19] RFR: 8288683: C2: And node gets wrong type due to not adding it back to the worklist in CCP In-Reply-To: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> References: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> Message-ID: On Fri, 24 Jun 2022 08:54:03 GMT, Christian Hagedorn wrote: > [JDK-8277850](https://bugs.openjdk.org/browse/JDK-8277850) added some new optimizations in `AndI/L::Value()` to optimize patterns similar to `(v << 2) & 3` which can directly be replaced by zero if the mask and the shifted value are bitwise disjoint. To do that, we look at the type of the shift value of the `LShift` input (right-hand side input): > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/mulnode.cpp*L1752-L1765__;Iw!!ACWV5N9M2RV99hQ!IX0kxz8Hw4KKIct2qc4QyH0Tw4qViketqc1p5lyF7ToSV6lBCrKAEcecOW_UtNGr3yZjwMFBsVthScQbBsEY1rqSbOc3V0IS7w$ > > The optimization as such works fine but there is a problem in CCP. After calling `Value()` for a node in CCP, we generally only add the direct users of it back to the worklist if the type changed: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1812-L1814__;Iw!!ACWV5N9M2RV99hQ!IX0kxz8Hw4KKIct2qc4QyH0Tw4qViketqc1p5lyF7ToSV6lBCrKAEcecOW_UtNGr3yZjwMFBsVthScQbBsEY1rqSbOcQ_GIUnA$ > > We special case some nodes where we need to add additional nodes (grandchildren or even further down) back to the worklist to not miss updating them, for example, the `Phis` when the use is a `Region`: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1789-L1796__;Iw!!ACWV5N9M2RV99hQ!IX0kxz8Hw4KKIct2qc4QyH0Tw4qViketqc1p5lyF7ToSV6lBCrKAEcecOW_UtNGr3yZjwMFBsVthScQbBsEY1rqSbOcoZ_uNbA$ > > However, we miss to special case an `LShift` use if the shift value (right-hand side input of the `LShift`) changed. We should add all `AndI/L` nodes back to the worklist to account for the `AndI/L::Value()` optimization. Not doing so can result in a wrong execution as shown with the testcase. We have the following nodes: > ![Screenshot from 2022-06-24 10-28-41](https://urldefense.com/v3/__https://user-images.githubusercontent.com/17833009/175496296-4280e26b-6f2f-4ddc-b164-b9e887a5d437.png__;!!ACWV5N9M2RV99hQ!IX0kxz8Hw4KKIct2qc4QyH0Tw4qViketqc1p5lyF7ToSV6lBCrKAEcecOW_UtNGr3yZjwMFBsVthScQbBsEY1rqSbOd3Td_duA$ ) > > The `LShiftI` node gets `int` as type (i.e. bottom) and is not put back on the worklist again since the type cannot improve anymore. Afterwards, we process the `AndI` node and call `AndI::Value()`. At this point, the phi node still has the temporary type `int:62`. We apply the optimization that the shifted number and the mask are bitwise disjoint and we set the type of the `AndI` node to `int:0`. When later reapplying `Phi::Value()` for the phi node, we correct the type to `int:62..69` and try to push the `LShiftI` node use back to the worklist. Since its type is `int`, we do not add it again. At this point, `AndI` is not on the CCP worklist anymore and neither will we push the `AndI` node to it again. We miss to reapply `AndI::Value()` and correct the now wrong `Value()` optimization. We keep `int:0` as type and replace the `AndI` node by constant zero - leading to a wrong execution. > > Special casing `LShift` -> `AndNodes` in CCP fixes the problem to make sure we reapply `AndI/L::Value()` again. I've applied some more refactorings and comment improvements but since this fix is for JDK 19, I've decided to separate them into an RFE ([JDK-8289051](https://bugs.openjdk.org/browse/JDK-8289051)) to reduce the risk. > > At some point, we should add some additional verification to find missed `Value()` calls in CCP to avoid similar problems in the future (see [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197)). > > Thanks, > Christian Looks good! test/hotspot/jtreg/compiler/c2/TestAndShiftZeroCCP.java line 28: > 26: * @bug 8288683 > 27: * @library /test/lib > 28: * @summary Test that And nodes are added to the CCP worklist if it has an LShift as input. Suggestion: * @summary Test that And nodes are added to the CCP worklist if they have an LShift as input. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/65 From thartmann at openjdk.org Mon Jun 27 11:04:40 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 27 Jun 2022 11:04:40 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v8] In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 08:11:55 GMT, Tobias Holenstein wrote: >> *Improvement of keyboard shortcuts in IGV under macOS:*. >> Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." >> >> *Fixed focus of the Graph Tab:*. >> In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. >> >> *Fixing QuickSearch:* >> Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. >> >> *Adding new actions to expand/shrink the difference selection:*. >> The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. >> ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!IAmx6yTPCqtwGLVcfxGI7w_9uT4wR1mo79cHHFUmKZs9bwhvPJivX-S3QXzCOE2qn4i-YsSXDhw-vvnwwgj_-IjziHmALImY6w$ ) >> This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. >> ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!IAmx6yTPCqtwGLVcfxGI7w_9uT4wR1mo79cHHFUmKZs9bwhvPJivX-S3QXzCOE2qn4i-YsSXDhw-vvnwwgj_-IjziHm5ILywCQ$ ) >> By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. >> >> **Fixed shortcuts for:** >> - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) >> - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) >> - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) >> >> **Added new shortcuts for:** >> - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) >> - Undo (`Ctrl/Cmd - Z`) >> - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) >> - Show Next Graph (`Ctrl/Cmd - RIGHT`) >> - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) >> - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) >> - Show Previous Graph (`Ctrl/Cmd - LEFT`) >> - Show satellite view (`Hold S`) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > fix whitespace Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9260 From thartmann at openjdk.org Mon Jun 27 11:05:51 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 27 Jun 2022 11:05:51 GMT Subject: RFR: JDK-8287094: IGV: show node input numbers in edge tooltips [v4] In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 08:59:18 GMT, Tobias Holenstein wrote: >> For nodes with many inputs, such as safepoints, it is difficult and error-prone to figure out the exact input number of a given incoming edge. >> >> Extend the Edge Tooltips to include the input number of the destination node: >> **Before** `91 Addl -> 92 SafePoint` >> **Now** `91 Addl -> 92 SafePoint [NR]` >> ![edge](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175506945-6f5137d2-7647-4acb-a135-8fcb719df3e6.png__;!!ACWV5N9M2RV99hQ!K5UUB-s-84-f9ehSsWQj8BKtw--wxBp4fYVS1O1-ZG1SSLZvEnhdKBXfoWZalzwQ7rHVxoK40qov4M7ljduU4EQp7_Vf-ytyVA$ ) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > Update src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/FigureConnection.java > > String concatenation > > Co-authored-by: Christian Hagedorn Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9273 From chagedorn at openjdk.org Mon Jun 27 11:24:40 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 27 Jun 2022 11:24:40 GMT Subject: RFR: JDK-8287094: IGV: show node input numbers in edge tooltips [v4] In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 08:59:18 GMT, Tobias Holenstein wrote: >> For nodes with many inputs, such as safepoints, it is difficult and error-prone to figure out the exact input number of a given incoming edge. >> >> Extend the Edge Tooltips to include the input number of the destination node: >> **Before** `91 Addl -> 92 SafePoint` >> **Now** `91 Addl -> 92 SafePoint [NR]` >> ![edge](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175506945-6f5137d2-7647-4acb-a135-8fcb719df3e6.png__;!!ACWV5N9M2RV99hQ!LdT7OayH6VV46s12ncaCY2fFOKK2Tc30ohGCkZqRfOWS8BGOvmyeDFntf9i8j4KLbwnWzjjJZxcqGwyGwhUgKeJkiMvCzs-hNw$ ) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > Update src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/FigureConnection.java > > String concatenation > > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9273 From chagedorn at openjdk.org Mon Jun 27 11:35:56 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 27 Jun 2022 11:35:56 GMT Subject: [jdk19] RFR: 8288683: C2: And node gets wrong type due to not adding it back to the worklist in CCP [v2] In-Reply-To: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> References: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> Message-ID: > [JDK-8277850](https://bugs.openjdk.org/browse/JDK-8277850) added some new optimizations in `AndI/L::Value()` to optimize patterns similar to `(v << 2) & 3` which can directly be replaced by zero if the mask and the shifted value are bitwise disjoint. To do that, we look at the type of the shift value of the `LShift` input (right-hand side input): > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/mulnode.cpp*L1752-L1765__;Iw!!ACWV5N9M2RV99hQ!N6XmmhVCC4ysHT5Zb3F6yIVPahAWjJAYl1wOyQhrNj6IHThZmxQZyarzlAAcYPDMpuKXbwdSiSpRSZQrwxnuEDJuExFrlA5fIg$ > > The optimization as such works fine but there is a problem in CCP. After calling `Value()` for a node in CCP, we generally only add the direct users of it back to the worklist if the type changed: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1812-L1814__;Iw!!ACWV5N9M2RV99hQ!N6XmmhVCC4ysHT5Zb3F6yIVPahAWjJAYl1wOyQhrNj6IHThZmxQZyarzlAAcYPDMpuKXbwdSiSpRSZQrwxnuEDJuExFjwv3fYw$ > > We special case some nodes where we need to add additional nodes (grandchildren or even further down) back to the worklist to not miss updating them, for example, the `Phis` when the use is a `Region`: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1789-L1796__;Iw!!ACWV5N9M2RV99hQ!N6XmmhVCC4ysHT5Zb3F6yIVPahAWjJAYl1wOyQhrNj6IHThZmxQZyarzlAAcYPDMpuKXbwdSiSpRSZQrwxnuEDJuExEBkqedWw$ > > However, we miss to special case an `LShift` use if the shift value (right-hand side input of the `LShift`) changed. We should add all `AndI/L` nodes back to the worklist to account for the `AndI/L::Value()` optimization. Not doing so can result in a wrong execution as shown with the testcase. We have the following nodes: > ![Screenshot from 2022-06-24 10-28-41](https://urldefense.com/v3/__https://user-images.githubusercontent.com/17833009/175496296-4280e26b-6f2f-4ddc-b164-b9e887a5d437.png__;!!ACWV5N9M2RV99hQ!N6XmmhVCC4ysHT5Zb3F6yIVPahAWjJAYl1wOyQhrNj6IHThZmxQZyarzlAAcYPDMpuKXbwdSiSpRSZQrwxnuEDJuExGO1x1Nlg$ ) > > The `LShiftI` node gets `int` as type (i.e. bottom) and is not put back on the worklist again since the type cannot improve anymore. Afterwards, we process the `AndI` node and call `AndI::Value()`. At this point, the phi node still has the temporary type `int:62`. We apply the optimization that the shifted number and the mask are bitwise disjoint and we set the type of the `AndI` node to `int:0`. When later reapplying `Phi::Value()` for the phi node, we correct the type to `int:62..69` and try to push the `LShiftI` node use back to the worklist. Since its type is `int`, we do not add it again. At this point, `AndI` is not on the CCP worklist anymore and neither will we push the `AndI` node to it again. We miss to reapply `AndI::Value()` and correct the now wrong `Value()` optimization. We keep `int:0` as type and replace the `AndI` node by constant zero - leading to a wrong execution. > > Special casing `LShift` -> `AndNodes` in CCP fixes the problem to make sure we reapply `AndI/L::Value()` again. I've applied some more refactorings and comment improvements but since this fix is for JDK 19, I've decided to separate them into an RFE ([JDK-8289051](https://bugs.openjdk.org/browse/JDK-8289051)) to reduce the risk. > > At some point, we should add some additional verification to find missed `Value()` calls in CCP to avoid similar problems in the future (see [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197)). > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/c2/TestAndShiftZeroCCP.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/65/files - new: https://git.openjdk.org/jdk19/pull/65/files/a76cc9b0..57e94de7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=65&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=65&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/65.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/65/head:pull/65 PR: https://git.openjdk.org/jdk19/pull/65 From chagedorn at openjdk.org Mon Jun 27 11:35:57 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 27 Jun 2022 11:35:57 GMT Subject: [jdk19] RFR: 8288683: C2: And node gets wrong type due to not adding it back to the worklist in CCP In-Reply-To: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> References: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> Message-ID: On Fri, 24 Jun 2022 08:54:03 GMT, Christian Hagedorn wrote: > [JDK-8277850](https://bugs.openjdk.org/browse/JDK-8277850) added some new optimizations in `AndI/L::Value()` to optimize patterns similar to `(v << 2) & 3` which can directly be replaced by zero if the mask and the shifted value are bitwise disjoint. To do that, we look at the type of the shift value of the `LShift` input (right-hand side input): > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/mulnode.cpp*L1752-L1765__;Iw!!ACWV5N9M2RV99hQ!Kxnv5U1iwnMJL7SNGVxwIn0SXWoLDrEzBmviggt6nwEKCgbhbMJcMoKeE4Ji8x-_QpOiTvOg_DKHG2qY4-ucoukZwj251cAOLA$ > > The optimization as such works fine but there is a problem in CCP. After calling `Value()` for a node in CCP, we generally only add the direct users of it back to the worklist if the type changed: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1812-L1814__;Iw!!ACWV5N9M2RV99hQ!Kxnv5U1iwnMJL7SNGVxwIn0SXWoLDrEzBmviggt6nwEKCgbhbMJcMoKeE4Ji8x-_QpOiTvOg_DKHG2qY4-ucoukZwj35JxT5yA$ > > We special case some nodes where we need to add additional nodes (grandchildren or even further down) back to the worklist to not miss updating them, for example, the `Phis` when the use is a `Region`: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1789-L1796__;Iw!!ACWV5N9M2RV99hQ!Kxnv5U1iwnMJL7SNGVxwIn0SXWoLDrEzBmviggt6nwEKCgbhbMJcMoKeE4Ji8x-_QpOiTvOg_DKHG2qY4-ucoukZwj19D7fsQg$ > > However, we miss to special case an `LShift` use if the shift value (right-hand side input of the `LShift`) changed. We should add all `AndI/L` nodes back to the worklist to account for the `AndI/L::Value()` optimization. Not doing so can result in a wrong execution as shown with the testcase. We have the following nodes: > ![Screenshot from 2022-06-24 10-28-41](https://urldefense.com/v3/__https://user-images.githubusercontent.com/17833009/175496296-4280e26b-6f2f-4ddc-b164-b9e887a5d437.png__;!!ACWV5N9M2RV99hQ!Kxnv5U1iwnMJL7SNGVxwIn0SXWoLDrEzBmviggt6nwEKCgbhbMJcMoKeE4Ji8x-_QpOiTvOg_DKHG2qY4-ucoukZwj0lDWfTJg$ ) > > The `LShiftI` node gets `int` as type (i.e. bottom) and is not put back on the worklist again since the type cannot improve anymore. Afterwards, we process the `AndI` node and call `AndI::Value()`. At this point, the phi node still has the temporary type `int:62`. We apply the optimization that the shifted number and the mask are bitwise disjoint and we set the type of the `AndI` node to `int:0`. When later reapplying `Phi::Value()` for the phi node, we correct the type to `int:62..69` and try to push the `LShiftI` node use back to the worklist. Since its type is `int`, we do not add it again. At this point, `AndI` is not on the CCP worklist anymore and neither will we push the `AndI` node to it again. We miss to reapply `AndI::Value()` and correct the now wrong `Value()` optimization. We keep `int:0` as type and replace the `AndI` node by constant zero - leading to a wrong execution. > > Special casing `LShift` -> `AndNodes` in CCP fixes the problem to make sure we reapply `AndI/L::Value()` again. I've applied some more refactorings and comment improvements but since this fix is for JDK 19, I've decided to separate them into an RFE ([JDK-8289051](https://bugs.openjdk.org/browse/JDK-8289051)) to reduce the risk. > > At some point, we should add some additional verification to find missed `Value()` calls in CCP to avoid similar problems in the future (see [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197)). > > Thanks, > Christian Thanks Roland and Tobias for your reviews! ------------- PR: https://git.openjdk.org/jdk19/pull/65 From chagedorn at openjdk.org Mon Jun 27 11:35:59 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 27 Jun 2022 11:35:59 GMT Subject: [jdk19] Integrated: 8288683: C2: And node gets wrong type due to not adding it back to the worklist in CCP In-Reply-To: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> References: <2AOZ4GZDzfj8-MJD_pKJA0ZjnWqqRJCsN5ZCm4O2384=.0a3cc07e-148a-4eab-a2d3-5da04146ba1f@github.com> Message-ID: On Fri, 24 Jun 2022 08:54:03 GMT, Christian Hagedorn wrote: > [JDK-8277850](https://bugs.openjdk.org/browse/JDK-8277850) added some new optimizations in `AndI/L::Value()` to optimize patterns similar to `(v << 2) & 3` which can directly be replaced by zero if the mask and the shifted value are bitwise disjoint. To do that, we look at the type of the shift value of the `LShift` input (right-hand side input): > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/mulnode.cpp*L1752-L1765__;Iw!!ACWV5N9M2RV99hQ!Lp6rj5qfngPeLfWDyNzIzP2mvHVplXhoRKOwAOJ4GmlbbgfgYtDG2D33bvqbWoCVwMsbLC10t7nPxFsfLlgG3scAmckJ_E4hZg$ > > The optimization as such works fine but there is a problem in CCP. After calling `Value()` for a node in CCP, we generally only add the direct users of it back to the worklist if the type changed: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1812-L1814__;Iw!!ACWV5N9M2RV99hQ!Lp6rj5qfngPeLfWDyNzIzP2mvHVplXhoRKOwAOJ4GmlbbgfgYtDG2D33bvqbWoCVwMsbLC10t7nPxFsfLlgG3scAmcmRqa0Gcg$ > > We special case some nodes where we need to add additional nodes (grandchildren or even further down) back to the worklist to not miss updating them, for example, the `Phis` when the use is a `Region`: > https://urldefense.com/v3/__https://github.com/openjdk/jdk19/blob/bdf9902f753b71f30be8e1634fc361a5c7d8d8ec/src/hotspot/share/opto/phaseX.cpp*L1789-L1796__;Iw!!ACWV5N9M2RV99hQ!Lp6rj5qfngPeLfWDyNzIzP2mvHVplXhoRKOwAOJ4GmlbbgfgYtDG2D33bvqbWoCVwMsbLC10t7nPxFsfLlgG3scAmclv68qwqA$ > > However, we miss to special case an `LShift` use if the shift value (right-hand side input of the `LShift`) changed. We should add all `AndI/L` nodes back to the worklist to account for the `AndI/L::Value()` optimization. Not doing so can result in a wrong execution as shown with the testcase. We have the following nodes: > ![Screenshot from 2022-06-24 10-28-41](https://urldefense.com/v3/__https://user-images.githubusercontent.com/17833009/175496296-4280e26b-6f2f-4ddc-b164-b9e887a5d437.png__;!!ACWV5N9M2RV99hQ!Lp6rj5qfngPeLfWDyNzIzP2mvHVplXhoRKOwAOJ4GmlbbgfgYtDG2D33bvqbWoCVwMsbLC10t7nPxFsfLlgG3scAmcmuFB5b2w$ ) > > The `LShiftI` node gets `int` as type (i.e. bottom) and is not put back on the worklist again since the type cannot improve anymore. Afterwards, we process the `AndI` node and call `AndI::Value()`. At this point, the phi node still has the temporary type `int:62`. We apply the optimization that the shifted number and the mask are bitwise disjoint and we set the type of the `AndI` node to `int:0`. When later reapplying `Phi::Value()` for the phi node, we correct the type to `int:62..69` and try to push the `LShiftI` node use back to the worklist. Since its type is `int`, we do not add it again. At this point, `AndI` is not on the CCP worklist anymore and neither will we push the `AndI` node to it again. We miss to reapply `AndI::Value()` and correct the now wrong `Value()` optimization. We keep `int:0` as type and replace the `AndI` node by constant zero - leading to a wrong execution. > > Special casing `LShift` -> `AndNodes` in CCP fixes the problem to make sure we reapply `AndI/L::Value()` again. I've applied some more refactorings and comment improvements but since this fix is for JDK 19, I've decided to separate them into an RFE ([JDK-8289051](https://bugs.openjdk.org/browse/JDK-8289051)) to reduce the risk. > > At some point, we should add some additional verification to find missed `Value()` calls in CCP to avoid similar problems in the future (see [JDK-8257197](https://bugs.openjdk.org/browse/JDK-8257197)). > > Thanks, > Christian This pull request has now been integrated. Changeset: 784a0f04 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk19/commit/784a0f049665afde4723942e641a10a1d7675f7a Stats: 110 lines in 3 files changed: 110 ins; 0 del; 0 mod 8288683: C2: And node gets wrong type due to not adding it back to the worklist in CCP Reviewed-by: roland, thartmann ------------- PR: https://git.openjdk.org/jdk19/pull/65 From tholenstein at openjdk.org Mon Jun 27 11:43:08 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Jun 2022 11:43:08 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v9] In-Reply-To: References: Message-ID: > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!NbyI_UYwMtFQr5-55aFL_Hx1xkKVOXWlEBf9XvLYS4mwM77QkN7H1U4u-MGIEMF6Z6h4nNCfaJPKx-9sxYvISgFlGGXhWljQgZaw$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!NbyI_UYwMtFQr5-55aFL_Hx1xkKVOXWlEBf9XvLYS4mwM77QkN7H1U4u-MGIEMF6Z6h4nNCfaJPKx-9sxYvISgFlGGXhWk3_PZ9W$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: code style ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9260/files - new: https://git.openjdk.org/jdk/pull/9260/files/c80ee6d7..4ea6eb78 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9260&range=07-08 Stats: 18 lines in 3 files changed: 0 ins; 12 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9260.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9260/head:pull/9260 PR: https://git.openjdk.org/jdk/pull/9260 From chagedorn at openjdk.org Mon Jun 27 11:43:09 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 27 Jun 2022 11:43:09 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v9] In-Reply-To: References: Message-ID: <2RpaY_FAsLSNGMkm0993IdVBHFUp7KyQyVS1tq2ycSY=.faefec25-21aa-408d-afa1-c3dcd2ec754f@github.com> On Mon, 27 Jun 2022 11:39:00 GMT, Tobias Holenstein wrote: >> *Improvement of keyboard shortcuts in IGV under macOS:*. >> Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." >> >> *Fixed focus of the Graph Tab:*. >> In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. >> >> *Fixing QuickSearch:* >> Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. >> >> *Adding new actions to expand/shrink the difference selection:*. >> The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. >> ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!PNd4oKd3c4vbTgUbDN6Y48O2oMHZ-peVvWhFsRp_RDi_4oSpg6b8qsUdSxzJSHszmLExn7yvVoNN2ilYrCZXHqpE5YskR2wM2g$ ) >> This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. >> ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!PNd4oKd3c4vbTgUbDN6Y48O2oMHZ-peVvWhFsRp_RDi_4oSpg6b8qsUdSxzJSHszmLExn7yvVoNN2ilYrCZXHqpE5Ytl5QzVNw$ ) >> By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. >> >> **Fixed shortcuts for:** >> - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) >> - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) >> - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) >> >> **Added new shortcuts for:** >> - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) >> - Undo (`Ctrl/Cmd - Z`) >> - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) >> - Show Next Graph (`Ctrl/Cmd - RIGHT`) >> - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) >> - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) >> - Show Previous Graph (`Ctrl/Cmd - LEFT`) >> - Show satellite view (`Hold S`) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > code style Looks good, thanks for doing the updates! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9260 From tholenstein at openjdk.org Mon Jun 27 11:43:10 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Jun 2022 11:43:10 GMT Subject: RFR: JDK-8288750: IGV: Improve Shortcuts [v5] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 12:17:11 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> added missing files > > That's a nice improvement, thanks for fixing these and adding new shortcuts together with icons! I've tried the shortcuts out and they seem to work fine (tested on Ubuntu 20.04). It makes the workflow a lot easier. > > I only have some minor code style comments, otherwise it looks good - but I'm not very familiar with the IGV code. @chhagedorn and @TobiHartmann thanks for the reviews! :) > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/EditorTopComponent.java line 372: > >> 370: centerPanel.add(SATELLITE_STRING, satelliteComponent); >> 371: >> 372: > > New line can be removed. done > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/ExpandDiffAction.java line 41: > >> 39: } >> 40: >> 41: public ExpandDiffAction(Lookup lookup) { > > `lookup` seems unused. Can be removed? done > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/ShrinkDiffAction.java line 40: > >> 38: } >> 39: >> 40: public ShrinkDiffAction(Lookup lookup) { > > `lookup` seems unused. Can be removed? done > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/ShrinkDiffAction.java line 67: > >> 65: int nfp = fp; >> 66: int nsp = (fp < sp) ? sp - 1 : sp; >> 67: model.setPositions(nfp, nsp); > > `nfp` can be inlined directly. Maybe you want to rename the variables instead of using abbreviations. Same in `ExpandDiffAction`. done ------------- PR: https://git.openjdk.org/jdk/pull/9260 From aph at openjdk.org Mon Jun 27 12:35:44 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 27 Jun 2022 12:35:44 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v4] In-Reply-To: References: Message-ID: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: More ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/948cda18..8f965c9f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=02-03 Stats: 28 lines in 2 files changed: 14 ins; 2 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From roland at openjdk.org Mon Jun 27 12:45:55 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 27 Jun 2022 12:45:55 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. In-Reply-To: References: Message-ID: On Thu, 2 Jun 2022 14:39:10 GMT, Roman Kennke wrote: >> With JDK-8277654, the load barrier slow path call doesn't produce raw >> memory anymore but the IU barrier call still does. I propose removing >> raw memory for that call too which also causes the assert that fails >> to be removed. > > Is it correct, though? I seem to remember that without the memory edges, we may get reordering of the 'SATB' buffer and index accesses between IU-barriers, which would cause troubles? @rkennke @shipilev thanks for the review and tests ------------- PR: https://git.openjdk.org/jdk/pull/8958 From roland at openjdk.org Mon Jun 27 12:45:56 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 27 Jun 2022 12:45:56 GMT Subject: Integrated: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. In-Reply-To: References: Message-ID: On Tue, 31 May 2022 14:46:58 GMT, Roland Westrelin wrote: > With JDK-8277654, the load barrier slow path call doesn't produce raw > memory anymore but the IU barrier call still does. I propose removing > raw memory for that call too which also causes the assert that fails > to be removed. This pull request has now been integrated. Changeset: 210a06a2 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/210a06a287521a554316a9052bd9fdf616c7b884 Stats: 11 lines in 2 files changed: 11 ins; 0 del; 0 mod 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. Reviewed-by: shade, rkennke ------------- PR: https://git.openjdk.org/jdk/pull/8958 From jvernee at openjdk.org Mon Jun 27 13:09:42 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Mon, 27 Jun 2022 13:09:42 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg In-Reply-To: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> Message-ID: On Fri, 24 Jun 2022 13:58:29 GMT, Andrew Haley wrote: > We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. I've been thinking about this some more after you fixed the same issue for `Register` on AArch64 [1]. I think the issue is out-of-line calls to member functions. Since `this` is a pointer, the compiler is forced to spill the value on the stack to comply with the ABI. i.e. what we'd really want is some way to say "pass `this` by value". (On x64 Windows, as long as a struct fits in a register, it is not passed by reference). To avoid that, I think we could theoretically convert every member function to a static function that takes a `VMReg` as it's first argument. That's _an_ option, but not a very nice one... (just mentioning it for the record). [1] : https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/6280*issuecomment-964069801__;Iw!!ACWV5N9M2RV99hQ!LcK6oiVlpFwjMC7na7Jv1Cr_ZsLCknDBWRAKpwLEUjhMCUqmffnGzXsausrVHVcKnEDldQOuHUladpWofy0FN8IZDHUqwOw$ --- I think the patch looks good overall, but it looks like there are some failures in some of the SA tests. ------------- PR: https://git.openjdk.org/jdk/pull/9276 From tholenstein at openjdk.org Mon Jun 27 13:25:40 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 27 Jun 2022 13:25:40 GMT Subject: Integrated: JDK-8288750: IGV: Improve Shortcuts In-Reply-To: References: Message-ID: On Thu, 23 Jun 2022 10:47:12 GMT, Tobias Holenstein wrote: > *Improvement of keyboard shortcuts in IGV under macOS:*. > Certain keyboard/mouse shortcuts do not work under macOS. E.g. `Ctrl + left-click` to select multiple nodes. The reason is that this keyboard shortcut is hardwired as a right-click under macOS and cannot be easily changed in the operating system. In general, the macOS user manual recommends using "Command/Meta" as a modifier key instead of "Control." > > *Fixed focus of the Graph Tab:*. > In IGV, shortcuts are linked to a component. Components are for example a Graph Tab, "Outline", "Filters", "Bytecode", "Control Flow" and "Properties". Shortcuts only work if the linked component is in focus. The focus can be changed with `Ctrl + TAB` or by clicking into the TAB component. The Graph Tab did not get the focus back when the user clicked on it. This needed to be fixed. > > *Fixing QuickSearch:* > Netbeans' QuickSearchAction is a global component of which only one common instance exists. IGV used a workaround to repaint the search bar in a new graphics tab. On macOS, the search bar doubled in size with each new Graph Tab. In addition, keyboard shortcuts for the search bar did not work. This issue was fixed by adding the search bar whenever the tab gained focus, and removing it (by default) when a new tab gained focus. This way, no workaround is required, and the size and ability to use a keyboard shortcut are fixed. > > *Adding new actions to expand/shrink the difference selection:*. > The user can expand/reduce the difference selection by moving the beginning/end of the selection with the mouse. > ![diff](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498713-df3c76e8-9945-4e1c-8cab-36d9d4ee64c1.png__;!!ACWV5N9M2RV99hQ!Oc4qKHbbNDCIoyXYMjOBf8zogoAYAxr7aVFfDm-pQCyiCif12p3W9eGVnDYUYC8TLdrM-uxdSbjDY1V8sesGOX3cMOVdK6FnxhK-$ ) > This is something many users didn't know. Therefore two new buttons should make it more clear for the user that this functionality exists. > ![actions](https://urldefense.com/v3/__https://user-images.githubusercontent.com/71546117/175498464-9e88e7d8-36df-4506-a801-a8d102d6bc4a.png__;!!ACWV5N9M2RV99hQ!Oc4qKHbbNDCIoyXYMjOBf8zogoAYAxr7aVFfDm-pQCyiCif12p3W9eGVnDYUYC8TLdrM-uxdSbjDY1V8sesGOX3cMOVdK71MrjWK$ ) > By adding these button we can now also add keyboard shortcuts to expand/reduce the difference selection. > > **Fixed shortcuts for:** > - Add a single node in the graph to selection (`Ctrl/Cmd + left-click`) > - Add a multiple node in the graph to selection (`Ctrl/Cmd + left-click-drag`) > - Zoom in and out (`Ctrl/Cmd + mouse-wheel`) > > **Added new shortcuts for:** > - Search (`Ctrl/Cmd - I` and `Ctrl/Cmd - F`) > - Undo (`Ctrl/Cmd - Z`) > - Redo (`Ctrl/Cmd - Y` and `Ctrl/Cmd - Shift - Z`) > - Show Next Graph (`Ctrl/Cmd - RIGHT`) > - Expand the difference selection (`Ctrl/Cmd - UP` and `Ctrl/Cmd - Shift - RIGHT`) > - Reduce the difference selection (`Ctrl/Cmd - DOWN` and `Ctrl/Cmd - Shift - LEFT`) > - Show Previous Graph (`Ctrl/Cmd - LEFT`) > - Show satellite view (`Hold S`) This pull request has now been integrated. Changeset: be6be15e Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/be6be15efa1fe85c4e2dacc181b3238f9190127e Stats: 742 lines in 33 files changed: 454 ins; 238 del; 50 mod 8288750: IGV: Improve Shortcuts Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9260 From coleenp at openjdk.org Mon Jun 27 14:58:47 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 27 Jun 2022 14:58:47 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 01:19:10 GMT, Ron Pressler wrote: >> src/hotspot/share/runtime/mutexLocker.cpp line 287: >> >>> 285: def(JfieldIdCreation_lock , PaddedMutex , safepoint); >>> 286: >>> 287: def(CompiledIC_lock , PaddedMutex , nosafepoint-1); // locks VtableStubs_lock, InlineCacheBuffer_lock >> >> Please explain. Is there another lock causing problems? > > The handshake lock, which is also nosafepoint. This should be ok, provided all the tests have been run. It reduces the rank of other locks, but there's still room in the lock rank range (by 1), and there's an assert for that. // These locks have relative rankings, and inherit safepoint checking attributes from that rank. defl(InlineCacheBuffer_lock , PaddedMutex , CompiledIC_lock); defl(VtableStubs_lock , PaddedMutex , CompiledIC_lock); // Also holds DumpTimeTable_lock defl(CodeCache_lock , PaddedMonitor, VtableStubs_lock); defl(CompiledMethod_lock , PaddedMutex , CodeCache_lock); defl(CodeSweeper_lock , PaddedMonitor, CompiledMethod_lock); ------------- PR: https://git.openjdk.org/jdk19/pull/66 From aph at openjdk.org Mon Jun 27 15:05:42 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 27 Jun 2022 15:05:42 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg In-Reply-To: References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> Message-ID: <_lP7-1R69GHQ1ETdUxb_motCZoWus5aiaCYFtvySDJg=.8158500b-ea2c-441c-b98b-48d671fdef76@github.com> On Mon, 27 Jun 2022 13:05:43 GMT, Jorn Vernee wrote: > I've been thinking about this some more after you fixed the same issue for `Register` on AArch64 [1]. I think the issue is out-of-line calls to member functions. Since `this` is a pointer, the compiler is forced to spill the value on the stack to comply with the ABI. i.e. what we'd really want is some way to say "pass `this` by value". (On x64 Windows, as long as a struct fits in a register, it is not passed by reference). Ah, I see. That makes sense. > I think the patch looks good overall, but it looks like there are some failures in some of the SA tests. Right. I'll start digging. ------------- PR: https://git.openjdk.org/jdk/pull/9276 From rehn at openjdk.org Mon Jun 27 15:24:43 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 27 Jun 2022 15:24:43 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 01:23:47 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Remove outdated comment" > > This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. Handshakes are per thread serialization points, so as Erik says, the thread going to interp mode will pick up the correct code, but other threads may or may not see it. Lock rank change may be okay, to much code to trace just say yes for me. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From xliu at openjdk.org Mon Jun 27 17:21:47 2022 From: xliu at openjdk.org (Xin Liu) Date: Mon, 27 Jun 2022 17:21:47 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v10] In-Reply-To: <9o8fXgQUo5J0LvKlWkLq-xmR16XInT_xWCV8ruauD30=.4a6ad1af-ad96-4d34-aca1-4bb68cc96782@github.com> References: <9o8fXgQUo5J0LvKlWkLq-xmR16XInT_xWCV8ruauD30=.4a6ad1af-ad96-4d34-aca1-4bb68cc96782@github.com> Message-ID: On Sat, 4 Jun 2022 16:17:19 GMT, Vladimir Kozlov wrote: >> 2 tests failed so far. I put information into RFE. > >> 2 tests failed so far. I put information into RFE. > > No other new failures in my tier1-7 testing. I think after you address found issue it will be ready to integrate (after second review by other Reviewer). But I would suggest to push it into JDK 20 after 19 is forked in one week to get more testing before release. hi, @vnkozlov Could you take a look at the new revision? thanks, --lx ------------- PR: https://git.openjdk.org/jdk/pull/8545 From dlong at openjdk.org Mon Jun 27 20:01:44 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 27 Jun 2022 20:01:44 GMT Subject: [jdk19] RFR: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding [v5] In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 05:08:36 GMT, Ningsheng Jian wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/codegen/ShiftByZero.java >> >> Co-authored-by: Hao Sun > > Looks good to me. Thanks @nsjian, @shqking, and @theRealAph. ------------- PR: https://git.openjdk.org/jdk19/pull/40 From dlong at openjdk.org Mon Jun 27 20:47:59 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 27 Jun 2022 20:47:59 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 01:23:47 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Remove outdated comment" > > This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.org/jdk19/pull/66 From dlong at openjdk.org Tue Jun 28 03:15:00 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 28 Jun 2022 03:15:00 GMT Subject: [jdk19] Integrated: 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 22:37:28 GMT, Dean Long wrote: > The range for aarch64 vector right-shift is 1 to the element width. This issue fixes the problem in the back-end. There is a separate problem in the front-end that shift by 0 is not always optimized out. This pull request has now been integrated. Changeset: b4490386 Author: Dean Long URL: https://git.openjdk.org/jdk19/commit/b4490386fe348250e88347526172c1c27ef01eba Stats: 115 lines in 4 files changed: 83 ins; 0 del; 32 mod 8288445: AArch64: C2 compilation fails with guarantee(!true || (true && (shift != 0))) failed: impossible encoding Reviewed-by: thartmann, haosun, njian ------------- PR: https://git.openjdk.org/jdk19/pull/40 From njian at openjdk.org Tue Jun 28 04:07:48 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Tue, 28 Jun 2022 04:07:48 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v5] In-Reply-To: References: Message-ID: On Tue, 21 Jun 2022 08:48:38 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge branch 'jdk:master' into JDK-8286941 > - Fix the ci build issue > - Address review comments, revert changes for gatherL/scatterL rules > - Merge branch 'jdk:master' into JDK-8286941 > - Revert transformation from MaskAll to VectorMaskGen, address review comments > - 8286941: Add mask IR for partial vector operations for ARM SVE src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3636: > 3634: #undef INSN > 3635: > 3636: // SVE predicate generation (32-bit and 64-bit variants) Suggestion: // SVE integer compare scalar count and limit ------------- PR: https://git.openjdk.org/jdk/pull/9037 From jbhateja at openjdk.org Tue Jun 28 06:11:49 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 28 Jun 2022 06:11:49 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Wed, 22 Jun 2022 03:01:36 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add comparison for direct value of compare src/hotspot/cpu/x86/x86_64.ad line 13027: > 13025: // Manifest a CmpU result in an integer register. Very painful. > 13026: // This is the test to avoid. > 13027: instruct cmpU3_reg_reg(rRegI dst, rRegI src1, rRegI src2, rFlagsReg flags) Do you plan to add 32 bit support? Integer pattern can be moved to common file x86.ad and 64 pattern can handled in 32/64 bit AD files. src/hotspot/cpu/x86/x86_64.ad line 13043: > 13041: __ cmpl($src1$$Register, $src2$$Register); > 13042: __ movl($dst$$Register, -1); > 13043: __ jccb(Assembler::below, done); By placing compare adjacent to conditional jump in-order frontend can trigger macro-fusion. Kindly refer section 3.4.2.2 of Intel's optimization manual. src/hotspot/cpu/x86/x86_64.ad line 13095: > 13093: __ cmpq($src1$$Register, $src2$$Register); > 13094: __ movl($dst$$Register, -1); > 13095: __ jccb(Assembler::below, done); Same as above. src/hotspot/share/opto/subnode.hpp line 185: > 183: } > 184: virtual int Opcode() const; > 185: virtual uint ideal_reg() const { return Op_RegI; } Value routine to handle constant folding. src/hotspot/share/opto/subnode.hpp line 247: > 245: init_class_id(Class_Sub); > 246: } > 247: virtual int Opcode() const; In-lining may connect the inputs to constant, hence a Value routine may be useful here. ------------- PR: https://git.openjdk.org/jdk/pull/9068 From duke at openjdk.org Tue Jun 28 06:32:43 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 28 Jun 2022 06:32:43 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Tue, 28 Jun 2022 05:51:42 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> add comparison for direct value of compare > > src/hotspot/cpu/x86/x86_64.ad line 13027: > >> 13025: // Manifest a CmpU result in an integer register. Very painful. >> 13026: // This is the test to avoid. >> 13027: instruct cmpU3_reg_reg(rRegI dst, rRegI src1, rRegI src2, rFlagsReg flags) > > Do you plan to add 32 bit support? > Integer pattern can be moved to common file x86.ad and 64 pattern can handled in 32/64 bit AD files. Yes I will add support for 32-bit after this patch, basic rules are often put in the bit-specific ad file so I think it would be more preferable to follow that convention here. > src/hotspot/share/opto/subnode.hpp line 247: > >> 245: init_class_id(Class_Sub); >> 246: } >> 247: virtual int Opcode() const; > > In-lining may connect the inputs to constant, hence a Value routine may be useful here. `CmpU3` inherits the `Value` method from its superclass `CmpU` ------------- PR: https://git.openjdk.org/jdk/pull/9068 From thartmann at openjdk.org Tue Jun 28 06:56:52 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 28 Jun 2022 06:56:52 GMT Subject: [jdk19] RFR: 8289069: Very slow C1 arraycopy jcstress tests after JDK-8279886 In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 04:35:59 GMT, Igor Veresov wrote: > I used `BlockBegin::number_of_blocks()` to size the bitmaps, however that is a total number of blocks. Since `mark_loops()` is called after every inlining (for every inlinee - no need to reanalyze the whole method), the bitmaps get progressively larger, and have to be zeroed. That makes the complexity quadratic. The solution is to appropriately size the bitmaps and keep the whole thing linear. Tests look good. Looks reasonable. src/hotspot/share/c1/c1_GraphBuilder.cpp line 413: > 411: // never go back through the original loop header. Therefore if there are any irreducible loops the bits in the states > 412: // for these loops are going to propagate back to the root. > 413: BlockBegin* start = _bci2block->at(0); There is a typo in line 408, that you might want to fix as well `irriducible` -> `irreducible`. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/72 From thartmann at openjdk.org Tue Jun 28 06:57:43 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 28 Jun 2022 06:57:43 GMT Subject: RFR: 8289071: Compute allocation sizes of stubs and nmethods outside of lock protection [v2] In-Reply-To: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> References: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> Message-ID: On Fri, 24 Jun 2022 17:16:37 GMT, Yi-Fan Tsai wrote: >> The pattern of computing the allocation size is inconsistent among the derived classes of CodeBlob. These sizes are not based on the shared resources but computed inside lock-protected regions in several classes. This change moves these cases outside the protected region. > > Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of https://urldefense.com/v3/__https://github.com/yftsai/jdk__;!!ACWV5N9M2RV99hQ!LlZECJx9UoTKI58AOfsPCsyJ95bxxAoaBCy8yGJ56b6wlmdwO_KZZ1NBxIU4jmTDNgOtQtsOtL8qVGo8sVkoTwOZN1sWQ6lIpg$ into JDK-8289071 > - 8289071: Compute stub sizes outside of locks Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9266 From jbhateja at openjdk.org Tue Jun 28 07:39:41 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 28 Jun 2022 07:39:41 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Tue, 28 Jun 2022 06:29:03 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/subnode.hpp line 247: >> >>> 245: init_class_id(Class_Sub); >>> 246: } >>> 247: virtual int Opcode() const; >> >> In-lining may connect the inputs to constant, hence a Value routine may be useful here. > > `CmpU3` inherits the `Value` method from its superclass `CmpU` Its fine then. ------------- PR: https://git.openjdk.org/jdk/pull/9068 From duke at openjdk.org Tue Jun 28 11:40:52 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Tue, 28 Jun 2022 11:40:52 GMT Subject: RFR: 8289071: Compute allocation sizes of stubs and nmethods outside of lock protection [v2] In-Reply-To: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> References: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> Message-ID: On Fri, 24 Jun 2022 17:16:37 GMT, Yi-Fan Tsai wrote: >> The pattern of computing the allocation size is inconsistent among the derived classes of CodeBlob. These sizes are not based on the shared resources but computed inside lock-protected regions in several classes. This change moves these cases outside the protected region. > > Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of https://github.com/yftsai/jdk into JDK-8289071 > - 8289071: Compute stub sizes outside of locks lgtm ------------- Marked as reviewed by eastig at github.com (no known OpenJDK username). PR: https://git.openjdk.org/jdk/pull/9266 From duke at openjdk.org Tue Jun 28 12:46:43 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 28 Jun 2022 12:46:43 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Wed, 22 Jun 2022 03:01:36 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add comparison for direct value of compare @jatin-bhateja Thanks a lot for your reviews and suggestions, I have answered your comments. ------------- PR: https://git.openjdk.org/jdk/pull/9068 From duke at openjdk.org Tue Jun 28 12:46:45 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 28 Jun 2022 12:46:45 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Tue, 28 Jun 2022 05:20:03 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> add comparison for direct value of compare > > src/hotspot/cpu/x86/x86_64.ad line 13043: > >> 13041: __ cmpl($src1$$Register, $src2$$Register); >> 13042: __ movl($dst$$Register, -1); >> 13043: __ jccb(Assembler::below, done); > > By placing compare adjacent to conditional jump in-order frontend can trigger macro-fusion. > Kindly refer section 3.4.2.2 of Intel's optimization manual. I realised that by swapping the `mov` and the `cmp` instruction, the rule needs to have `dst` different from `src1` and `src2`, which increases register pressure. ------------- PR: https://git.openjdk.org/jdk/pull/9068 From phh at openjdk.org Tue Jun 28 15:14:42 2022 From: phh at openjdk.org (Paul Hohensee) Date: Tue, 28 Jun 2022 15:14:42 GMT Subject: RFR: 8289071: Compute allocation sizes of stubs and nmethods outside of lock protection [v2] In-Reply-To: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> References: <2N6fS9zQViWs6EK5ehhH3AmGjjabINboLJ_By12AKyA=.669c76ba-9a9e-4818-911a-0ae7fb327bde@github.com> Message-ID: On Fri, 24 Jun 2022 17:16:37 GMT, Yi-Fan Tsai wrote: >> The pattern of computing the allocation size is inconsistent among the derived classes of CodeBlob. These sizes are not based on the shared resources but computed inside lock-protected regions in several classes. This change moves these cases outside the protected region. > > Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of https://github.com/yftsai/jdk into JDK-8289071 > - 8289071: Compute stub sizes outside of locks Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.org/jdk/pull/9266 From duke at openjdk.org Tue Jun 28 15:19:10 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Tue, 28 Jun 2022 15:19:10 GMT Subject: Integrated: 8289071: Compute allocation sizes of stubs and nmethods outside of lock protection In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 00:05:25 GMT, Yi-Fan Tsai wrote: > The pattern of computing the allocation size is inconsistent among the derived classes of CodeBlob. These sizes are not based on the shared resources but computed inside lock-protected regions in several classes. This change moves these cases outside the protected region. This pull request has now been integrated. Changeset: 88fe19c5 Author: Yi-Fan Tsai Committer: Paul Hohensee URL: https://git.openjdk.org/jdk/commit/88fe19c5b2d809d5b9136e1a86887a50d0eeeb55 Stats: 25 lines in 2 files changed: 8 ins; 7 del; 10 mod 8289071: Compute allocation sizes of stubs and nmethods outside of lock protection Reviewed-by: thartmann, phh ------------- PR: https://git.openjdk.org/jdk/pull/9266 From iveresov at openjdk.org Tue Jun 28 16:27:25 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 28 Jun 2022 16:27:25 GMT Subject: [jdk19] RFR: 8289069: Very slow C1 arraycopy jcstress tests after JDK-8279886 [v2] In-Reply-To: References: Message-ID: > I used `BlockBegin::number_of_blocks()` to size the bitmaps, however that is a total number of blocks. Since `mark_loops()` is called after every inlining (for every inlinee - no need to reanalyze the whole method), the bitmaps get progressively larger, and have to be zeroed. That makes the complexity quadratic. The solution is to appropriately size the bitmaps and keep the whole thing linear. Tests look good. Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: Fix a typo ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/72/files - new: https://git.openjdk.org/jdk19/pull/72/files/6f20dec2..8abfae6e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=72&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=72&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/72.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/72/head:pull/72 PR: https://git.openjdk.org/jdk19/pull/72 From kvn at openjdk.org Tue Jun 28 16:27:25 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Jun 2022 16:27:25 GMT Subject: [jdk19] RFR: 8289069: Very slow C1 arraycopy jcstress tests after JDK-8279886 [v2] In-Reply-To: References: Message-ID: On Tue, 28 Jun 2022 16:23:55 GMT, Igor Veresov wrote: >> I used `BlockBegin::number_of_blocks()` to size the bitmaps, however that is a total number of blocks. Since `mark_loops()` is called after every inlining (for every inlinee - no need to reanalyze the whole method), the bitmaps get progressively larger, and have to be zeroed. That makes the complexity quadratic. The solution is to appropriately size the bitmaps and keep the whole thing linear. Tests look good. > > Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: > > Fix a typo Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/72 From iveresov at openjdk.org Tue Jun 28 16:27:26 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 28 Jun 2022 16:27:26 GMT Subject: [jdk19] RFR: 8289069: Very slow C1 arraycopy jcstress tests after JDK-8279886 In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 04:35:59 GMT, Igor Veresov wrote: > I used `BlockBegin::number_of_blocks()` to size the bitmaps, however that is a total number of blocks. Since `mark_loops()` is called after every inlining (for every inlinee - no need to reanalyze the whole method), the bitmaps get progressively larger, and have to be zeroed. That makes the complexity quadratic. The solution is to appropriately size the bitmaps and keep the whole thing linear. Tests look good. Thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk19/pull/72 From iveresov at openjdk.org Tue Jun 28 16:27:27 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 28 Jun 2022 16:27:27 GMT Subject: [jdk19] RFR: 8289069: Very slow C1 arraycopy jcstress tests after JDK-8279886 [v2] In-Reply-To: References: Message-ID: On Tue, 28 Jun 2022 06:34:08 GMT, Tobias Hartmann wrote: >> Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix a typo > > src/hotspot/share/c1/c1_GraphBuilder.cpp line 413: > >> 411: // never go back through the original loop header. Therefore if there are any irreducible loops the bits in the states >> 412: // for these loops are going to propagate back to the root. >> 413: BlockBegin* start = _bci2block->at(0); > > There is a typo in line 408, that you might want to fix as well `irriducible` -> `irreducible`. Fixed. Thanks! ------------- PR: https://git.openjdk.org/jdk19/pull/72 From iveresov at openjdk.org Tue Jun 28 16:29:55 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 28 Jun 2022 16:29:55 GMT Subject: [jdk19] Integrated: 8289069: Very slow C1 arraycopy jcstress tests after JDK-8279886 In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 04:35:59 GMT, Igor Veresov wrote: > I used `BlockBegin::number_of_blocks()` to size the bitmaps, however that is a total number of blocks. Since `mark_loops()` is called after every inlining (for every inlinee - no need to reanalyze the whole method), the bitmaps get progressively larger, and have to be zeroed. That makes the complexity quadratic. The solution is to appropriately size the bitmaps and keep the whole thing linear. Tests look good. This pull request has now been integrated. Changeset: 9b7805e3 Author: Igor Veresov URL: https://git.openjdk.org/jdk19/commit/9b7805e3b4b3b5248a5cf8a5a5f3cf2260784a3b Stats: 32 lines in 1 file changed: 10 ins; 0 del; 22 mod 8289069: Very slow C1 arraycopy jcstress tests after JDK-8279886 Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk19/pull/72 From duke at openjdk.org Tue Jun 28 16:41:36 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Tue, 28 Jun 2022 16:41:36 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v3] In-Reply-To: References: Message-ID: On Fri, 3 Jun 2022 18:06:16 GMT, Jorn Vernee wrote: >> Yi-Fan Tsai has updated the pull request incrementally with one additional commit since the last revision: >> >> Post dynamic_code_generate event when MH intrinsic generated > > src/hotspot/share/code/codeBlob.cpp line 347: > >> 345: { >> 346: MutexLocker mu(CodeCache_lock, Mutex::_no_safepoint_check_flag); >> 347: int mhi_size = CodeBlob::allocation_size(code_buffer, sizeof(MethodHandleIntrinsicBlob)); > > The allocation size could also be computed before taking the code cache lock. BufferBlob also does this for example, but others don't. I think it makes sense to have it outside of the mutex block though, to minimize the time we need to hold the lock. I don't see anything in there that seems to require the lock. (Maybe we should clean up other cases in a followup as well). The issue was [filed](https://bugs.openjdk.org/browse/JDK-8289071) and [fixed](https://github.com/openjdk/jdk/pull/9266). ------------- PR: https://git.openjdk.org/jdk/pull/8760 From jvernee at openjdk.org Tue Jun 28 19:44:28 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Tue, 28 Jun 2022 19:44:28 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v3] In-Reply-To: References: Message-ID: <9w1jl4SXtQ_STu3LFt1yv2YqcfMZHvlPRYLUGg4UN3k=.eb2e7a84-d6ce-4577-9488-5efe7e3a242b@github.com> On Tue, 28 Jun 2022 16:38:20 GMT, Yi-Fan Tsai wrote: >> src/hotspot/share/code/codeBlob.cpp line 347: >> >>> 345: { >>> 346: MutexLocker mu(CodeCache_lock, Mutex::_no_safepoint_check_flag); >>> 347: int mhi_size = CodeBlob::allocation_size(code_buffer, sizeof(MethodHandleIntrinsicBlob)); >> >> The allocation size could also be computed before taking the code cache lock. BufferBlob also does this for example, but others don't. I think it makes sense to have it outside of the mutex block though, to minimize the time we need to hold the lock. I don't see anything in there that seems to require the lock. (Maybe we should clean up other cases in a followup as well). > > The issue was [filed](https://bugs.openjdk.org/browse/JDK-8289071) and [fixed](https://github.com/openjdk/jdk/pull/9266). Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/8760 From kvn at openjdk.org Tue Jun 28 21:08:43 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Jun 2022 21:08:43 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v14] In-Reply-To: <_6iPSDvWGj8uGcVNGdwhRBa23bCVOVaMsUhY0crvxYM=.112ba1de-6a1a-417c-8446-3413a6ab8157@github.com> References: <_6iPSDvWGj8uGcVNGdwhRBa23bCVOVaMsUhY0crvxYM=.112ba1de-6a1a-417c-8446-3413a6ab8157@github.com> Message-ID: On Thu, 23 Jun 2022 23:08:20 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 >> relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 >> main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 >> stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 >> oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 >> metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 >> scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 >> scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 >> dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 >> handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 >> nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> >> Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 >> relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 >> main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 >> stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 >> oops [0x00007f35189ab388,0x00007f35189ab390] = 8 >> metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 >> scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 >> scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 >> dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 >> handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 >> nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > remove _path from UnstableIfTrap. remember _next_bci(int) is enough. New changes are good except the change to flag. src/hotspot/share/opto/c2_globals.hpp line 420: > 418: \ > 419: develop(bool, OptimizeUnstableIf, true, \ > 420: "Optimize UnstableIf traps") \ New name is good. Why you changed it to `develop` flag which is available only in `debug` VM? I want to keep it `diagnostic` so we can switch it off in `product` VM too (with `-XX:+UnlockDiagnosticVMOptions` flag`) src/hotspot/share/opto/ifnode.cpp line 842: > 840: if (!igvn->C->too_many_traps(dom_method, dom_bci, Deoptimization::Reason_unstable_fused_if) && > 841: !igvn->C->too_many_traps(dom_method, dom_bci, Deoptimization::Reason_range_check) && > 842: igvn->C->remove_unstable_if_trap(dom_unc, true)) { Add comment about what remove_unstable_if_trap() does here. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From kvn at openjdk.org Tue Jun 28 23:02:28 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Jun 2022 23:02:28 GMT Subject: RFR: 8288022: c2: Transform (CastLL (AddL into (AddL (CastLL when possible In-Reply-To: References: Message-ID: <2013On5bksWCZIbW8hDS9qg-fRUnwD1FPb_XxVVz3es=.65fdb0f2-cb56-44dd-8af7-7d2d35b2dad3@github.com> On Mon, 13 Jun 2022 08:26:47 GMT, Roland Westrelin wrote: > This implements a transformation that already exists for CastII and > ConvI2L and helps code generation. The tricky part is that: > > (CastII (AddI into (AddI (CastII > > is performed by first computing the bounds of the type of the AddI. To > protect against overflow, jlong variables are used. With CastLL/AddL > nodes there's no larger integer type to promote the bounds to. As a > consequence the logic in the patch explicitly tests for overflow. That > logic is shared by the int and long cases. The previous logic for the > int cases that promotes values to long is used as verification. > > This patch also widens the type of CastLL nodes after loop opts the > way it's done for CastII/ConvI2L to allow commoning of nodes. > > This was observed to help with Memory Segment micro benchmarks. Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9139 From kvn at openjdk.org Tue Jun 28 23:14:30 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Jun 2022 23:14:30 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v5] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 02:27:39 GMT, Yuta Sato wrote: >> When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), >> there is no warning message (only can see info level messages if put -Xlog:os=info). >> This should show a warning message to tell the user that you failed to load libraries for hsdis. >> So I put a warning message to notify this. >> >> e.g. >> ` > > Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: > > Update full name Nice. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8782 From kvn at openjdk.org Tue Jun 28 23:16:37 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Jun 2022 23:16:37 GMT Subject: RFR: JDK-8288121: [JVMCI] Re-export the TerminatingThreadLocal functionality to the graal compiler. In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 15:01:26 GMT, Tom?? Zezula wrote: > JVMCI compilers need to release resources tied to a thread-local variable when the associated thread is exiting. The JDK internally uses the jdk.internal.misc.TerminatingThreadLocal class for this purpose. This pull request re-exports the TerminatingThreadLocal functionality to JVMCI compilers. @dougxc, can you review this? It is out of my expertise. ------------- PR: https://git.openjdk.org/jdk/pull/9107 From kvn at openjdk.org Tue Jun 28 23:40:38 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 28 Jun 2022 23:40:38 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Tue, 21 Jun 2022 18:35:33 GMT, Evgeny Astigeevich wrote: > > GHA testing is not clean. > > I looked through changes and they seem logically correct. Need more testing. I will wait when GHA is clean. > > Vladimir(@vnkozlov), Have you got testing results? What I meant is that I will not submit my own testing until GitHub action testing is clean. Which is not which means something is wrong with changes: https://github.com/openjdk/jdk/pull/8816/checks?check_run_id=6998367114 Please, fix issues and update to latest JDK sources. ------------- PR: https://git.openjdk.org/jdk/pull/8816 From duke at openjdk.org Wed Jun 29 01:13:43 2022 From: duke at openjdk.org (Yuta Sato) Date: Wed, 29 Jun 2022 01:13:43 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries [v4] In-Reply-To: References: Message-ID: <9PoWivpjrp-GN6ZfggpRQJ5aidHzAsq_r4BjeDP7OGg=.a2969c9f-aba9-4f3d-99e5-92cda03f699d@github.com> On Thu, 23 Jun 2022 08:08:06 GMT, Yasumasa Suenaga wrote: >> Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: >> >> change warning message > > Looks good. I will sponsor you if you need. Thank you for reviewing @YaSuenag @vnkozlov !! ------------- PR: https://git.openjdk.org/jdk/pull/8782 From duke at openjdk.org Wed Jun 29 01:21:43 2022 From: duke at openjdk.org (Yuta Sato) Date: Wed, 29 Jun 2022 01:21:43 GMT Subject: Integrated: 8287001: Add warning message when fail to load hsdis libraries In-Reply-To: References: Message-ID: On Thu, 19 May 2022 06:37:28 GMT, Yuta Sato wrote: > When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), > there is no warning message (only can see info level messages if put -Xlog:os=info). > This should show a warning message to tell the user that you failed to load libraries for hsdis. > So I put a warning message to notify this. > > e.g. > ` This pull request has now been integrated. Changeset: 779b4e1d Author: Yuta Sato Committer: Yasumasa Suenaga URL: https://git.openjdk.org/jdk/commit/779b4e1d1959bc15a27492b7e2b951678e39cca8 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8287001: Add warning message when fail to load hsdis libraries Reviewed-by: kvn, ysuenaga ------------- PR: https://git.openjdk.org/jdk/pull/8782 From jbhateja at openjdk.org Wed Jun 29 02:15:38 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 29 Jun 2022 02:15:38 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Tue, 28 Jun 2022 12:42:57 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/x86_64.ad line 13043: >> >>> 13041: __ cmpl($src1$$Register, $src2$$Register); >>> 13042: __ movl($dst$$Register, -1); >>> 13043: __ jccb(Assembler::below, done); >> >> By placing compare adjacent to conditional jump in-order frontend can trigger macro-fusion. >> Kindly refer section 3.4.2.2 of Intel's optimization manual. > > I realised that by swapping the `mov` and the `cmp` instruction, the rule needs to have `dst` different from `src1` and `src2`, which increases register pressure. I do not follow your comment, allocation decisions purely based on LRGs interferences and data flow attributes attached to operands and is agnostic to encoding block contents. ------------- PR: https://git.openjdk.org/jdk/pull/9068 From duke at openjdk.org Wed Jun 29 02:25:39 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Wed, 29 Jun 2022 02:25:39 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Wed, 29 Jun 2022 02:12:24 GMT, Jatin Bhateja wrote: >> I realised that by swapping the `mov` and the `cmp` instruction, the rule needs to have `dst` different from `src1` and `src2`, which increases register pressure. > > I do not follow your comment, allocation decisions purely based on LRGs interferences and data flow attributes attached to operands and is agnostic to encoding block contents. Your suggestion requires us having additional `TEMP dst` for the match rule. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9068 From jbhateja at openjdk.org Wed Jun 29 04:23:45 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 29 Jun 2022 04:23:45 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Wed, 29 Jun 2022 02:22:02 GMT, Quan Anh Mai wrote: >> I do not follow your comment, allocation decisions purely based on LRGs interferences and data flow attributes attached to operands and is agnostic to encoding block contents. > > Your suggestion requires us having additional `TEMP dst` for the match rule. Thanks. Yes, macro fusion is a fine microarchitectural optimization which can reduce load on entire execution pipeline and is **deterministic** for specific pair of cmp + jump instructions, you have aggregated destination's defs and its usages towards the tail which can save TEMP attribution on destination operand and may save a redundant spill only for high register pressure blocks. I am ok with existing handling. Thanks for your explanations. ------------- PR: https://git.openjdk.org/jdk/pull/9068 From xliu at openjdk.org Wed Jun 29 05:23:40 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 29 Jun 2022 05:23:40 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v15] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) > total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 > relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 > main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 > stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 > oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 > metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 > scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 > scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 > dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 > handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 > nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > > Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) > total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 > relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 > main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 > stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 > oops [0x00007f35189ab388,0x00007f35189ab390] = 8 > metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 > scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 > scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 > dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 > handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 > nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: restore the option OptimizeUnstableIf to diagnostic. also update commments. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8545/files - new: https://git.openjdk.org/jdk/pull/8545/files/49bcc410..ed20a826 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8545&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8545&range=13-14 Stats: 15 lines in 4 files changed: 4 ins; 6 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.org/jdk/pull/8545 From xliu at openjdk.org Wed Jun 29 05:23:43 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 29 Jun 2022 05:23:43 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v14] In-Reply-To: References: <_6iPSDvWGj8uGcVNGdwhRBa23bCVOVaMsUhY0crvxYM=.112ba1de-6a1a-417c-8446-3413a6ab8157@github.com> Message-ID: On Tue, 28 Jun 2022 20:47:19 GMT, Vladimir Kozlov wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> remove _path from UnstableIfTrap. remember _next_bci(int) is enough. > > src/hotspot/share/opto/c2_globals.hpp line 420: > >> 418: \ >> 419: develop(bool, OptimizeUnstableIf, true, \ >> 420: "Optimize UnstableIf traps") \ > > New name is good. > > Why you changed it to `develop` flag which is available only in `debug` VM? I want to keep it `diagnostic` so we can switch it off in `product` VM too (with `-XX:+UnlockDiagnosticVMOptions` flag`) I see. revert it. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From jbhateja at openjdk.org Wed Jun 29 06:09:33 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 29 Jun 2022 06:09:33 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: <452dySWhqSw4rDXLA1MQR3x3Nz3Xt4wYdJ0j7UYCVyA=.a3acfa1c-658b-4313-927a-0c47146a79e7@github.com> On Wed, 22 Jun 2022 03:01:36 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add comparison for direct value of compare Marked as reviewed by jbhateja (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/9068 From xgong at openjdk.org Wed Jun 29 06:13:36 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 29 Jun 2022 06:13:36 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9037/files - new: https://git.openjdk.org/jdk/pull/9037/files/caf5eb03..04a2d827 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Wed Jun 29 06:13:40 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 29 Jun 2022 06:13:40 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v5] In-Reply-To: References: Message-ID: On Tue, 28 Jun 2022 03:53:26 GMT, Ningsheng Jian wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: >> >> - Merge branch 'jdk:master' into JDK-8286941 >> - Fix the ci build issue >> - Address review comments, revert changes for gatherL/scatterL rules >> - Merge branch 'jdk:master' into JDK-8286941 >> - Revert transformation from MaskAll to VectorMaskGen, address review comments >> - 8286941: Add mask IR for partial vector operations for ARM SVE > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3636: > >> 3634: #undef INSN >> 3635: >> 3636: // SVE predicate generation (32-bit and 64-bit variants) > > Suggestion: > > // SVE integer compare scalar count and limit Thanks for the comments! I'v updated the patch. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From njian at openjdk.org Wed Jun 29 06:22:17 2022 From: njian at openjdk.org (Ningsheng Jian) Date: Wed, 29 Jun 2022 06:22:17 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:13:36 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments AArch64 changes look good to me. ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Wed Jun 29 06:22:17 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 29 Jun 2022 06:22:17 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:18:26 GMT, Ningsheng Jian wrote: > AArch64 changes look good to me. Thanks for the review @nsjian ! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Wed Jun 29 06:30:41 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 29 Jun 2022 06:30:41 GMT Subject: RFR: 8287984: AArch64: [vector] Make all bits set vector sharable for match rules In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 01:37:03 GMT, Xiaohong Gong wrote: > We have the optimized rules for vector not/and_not in NEON and SVE, like: > > > match(Set dst (XorV src (ReplicateB m1))) ; vector not > match(Set dst (AndV src1 (XorV src2 (ReplicateB m1)))) ; vector and_not > > > where "`m1`" is a ConI node with value -1. And we also have the similar rules for vector mask in SVE like: > > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))) ; mask and_not > > > These rules are not easy to be matched since the "`Replicate`" or "`MaskAll`" node is usually not single used for the `not/and_not` operation. To make these rules be matched as expected, this patch adds the vector (mask) "`not`" pattern to `Matcher::pd_clone_node()` which makes the all bits set vector `(Replicate/MaskAll)` sharable during matching rules. Hi there, could anyone please take a look at this simple patch? Thanks so much! ------------- PR: https://git.openjdk.org/jdk/pull/9292 From duke at openjdk.org Wed Jun 29 06:54:07 2022 From: duke at openjdk.org (KIRIYAMA Takuya) Date: Wed, 29 Jun 2022 06:54:07 GMT Subject: RFR: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting Message-ID: The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. I fixed it to use the appropriate value if the argument of cmd was null. Please review this change. ------------- Commit messages: - 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting Changes: https://git.openjdk.org/jdk/pull/9318/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9318&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289427 Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9318.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9318/head:pull/9318 PR: https://git.openjdk.org/jdk/pull/9318 From duke at openjdk.org Wed Jun 29 07:20:42 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Wed, 29 Jun 2022 07:20:42 GMT Subject: RFR: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long [v3] In-Reply-To: References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Wed, 22 Jun 2022 03:01:36 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add comparison for direct value of compare Thank you very much for your reviews ------------- PR: https://git.openjdk.org/jdk/pull/9068 From dnsimon at openjdk.org Wed Jun 29 07:49:38 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 29 Jun 2022 07:49:38 GMT Subject: RFR: JDK-8288121: [JVMCI] Re-export the TerminatingThreadLocal functionality to the graal compiler. In-Reply-To: References: Message-ID: On Thu, 9 Jun 2022 15:01:26 GMT, Tom?? Zezula wrote: > JVMCI compilers need to release resources tied to a thread-local variable when the associated thread is exiting. The JDK internally uses the jdk.internal.misc.TerminatingThreadLocal class for this purpose. This pull request re-exports the TerminatingThreadLocal functionality to JVMCI compilers. Marked as reviewed by dnsimon (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/9107 From dnsimon at openjdk.org Wed Jun 29 07:52:30 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 29 Jun 2022 07:52:30 GMT Subject: RFR: JDK-8288121: [JVMCI] Re-export the TerminatingThreadLocal functionality to the graal compiler. In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 07:45:45 GMT, Doug Simon wrote: >> JVMCI compilers need to release resources tied to a thread-local variable when the associated thread is exiting. The JDK internally uses the jdk.internal.misc.TerminatingThreadLocal class for this purpose. This pull request re-exports the TerminatingThreadLocal functionality to JVMCI compilers. > > Marked as reviewed by dnsimon (Committer). > @dougxc, can you review this? It is out of my expertise. Since I helped craft this change, it should really get another reviewer. @dholmes-ora maybe you could look at it? TL;DR this PR is exposing `TerminatingThreadLocal` for use by Graal. ------------- PR: https://git.openjdk.org/jdk/pull/9107 From thartmann at openjdk.org Wed Jun 29 08:12:43 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 29 Jun 2022 08:12:43 GMT Subject: RFR: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:47:07 GMT, KIRIYAMA Takuya wrote: > The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. > I fixed it to use the appropriate value if the argument of cmd was null. > Please review this change. Just wondering, what about the error reported in [JDK-8225370](https://bugs.openjdk.org/browse/JDK-8225370) then? ------------- PR: https://git.openjdk.org/jdk/pull/9318 From rrich at openjdk.org Wed Jun 29 08:31:06 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 29 Jun 2022 08:31:06 GMT Subject: RFR: 8289434: x86_64: Improve comment on gen_continuation_enter() Message-ID: <6cp6qV9w9mWrUZKkucomdNg9eHYUVj06unri-TFLlbY=.a793a9e6-9fd9-4451-8fb2-0c8df241e893@github.com> Change code comments for `gen_continuation_enter()` explaining that the generated code will call `Continuation.enter(Continuation c, boolean isContinue)` if the continuation give as first parameter is run for the first time. Also mention the special case for resolving this call. ------------- Commit messages: - 8289434: Improve comment on gen_continuation_enter() Changes: https://git.openjdk.org/jdk/pull/9320/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9320&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289434 Stats: 6 lines in 1 file changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9320.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9320/head:pull/9320 PR: https://git.openjdk.org/jdk/pull/9320 From ysuenaga at openjdk.org Wed Jun 29 09:05:04 2022 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Wed, 29 Jun 2022 09:05:04 GMT Subject: RFR: 8289421: No-PCH build for Minimal VM was broken by JDK-8287001 Message-ID: We see following error if we pass `--with-jvm-variants=minimal --disable-precompiled-headers` to configure script. src/hotspot/share/compiler/disassembler.cpp:841:5: error: 'log_warning' was not declared in this scope; did you mean 'warning'? [2022-06-29T01:25:45,429Z] 841 | log_warning(os)("Loading hsdis library failed"); [2022-06-29T01:25:45,429Z] | ^~~~~~~~~~~ [2022-06-29T01:25:45,429Z] | warning Missing include of log.hpp ------------- Commit messages: - 8289421: No-PCH build for Minimal VM was broken by JDK-8287001 Changes: https://git.openjdk.org/jdk/pull/9323/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9323&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289421 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9323.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9323/head:pull/9323 PR: https://git.openjdk.org/jdk/pull/9323 From mbaesken at openjdk.org Wed Jun 29 09:10:39 2022 From: mbaesken at openjdk.org (Matthias Baesken) Date: Wed, 29 Jun 2022 09:10:39 GMT Subject: RFR: 8289421: No-PCH build for Minimal VM was broken by JDK-8287001 In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 08:57:54 GMT, Yasumasa Suenaga wrote: > We see following error if we pass `--with-jvm-variants=minimal --disable-precompiled-headers` to configure script. > > > src/hotspot/share/compiler/disassembler.cpp:841:5: error: 'log_warning' was not declared in this scope; did you mean 'warning'? > [2022-06-29T01:25:45,429Z] 841 | log_warning(os)("Loading hsdis library failed"); > [2022-06-29T01:25:45,429Z] | ^~~~~~~~~~~ > [2022-06-29T01:25:45,429Z] | warning > > > Missing include of log.hpp Marked as reviewed by mbaesken (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9323 From jbhateja at openjdk.org Wed Jun 29 09:15:05 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 29 Jun 2022 09:15:05 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. Message-ID: Hi All, [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. Please find below the JMH micro stats with and without patch. System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] Baseline: Benchmark (inSize) (outSize) Mode Cnt Score Error Units LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms With Opt: Benchmark (inSize) (outSize) Mode Cnt Score Error Units LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms Predicated memory operation over sub-word type will be handled in a subsequent patch. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8289186: Support predicated vector load/store operations over X86 AVX2 targets. Changes: https://git.openjdk.org/jdk/pull/9324/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9324&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289186 Stats: 226 lines in 15 files changed: 180 ins; 38 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9324.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9324/head:pull/9324 PR: https://git.openjdk.org/jdk/pull/9324 From jiefu at openjdk.org Wed Jun 29 09:15:29 2022 From: jiefu at openjdk.org (Jie Fu) Date: Wed, 29 Jun 2022 09:15:29 GMT Subject: RFR: 8289421: No-PCH build for Minimal VM was broken by JDK-8287001 In-Reply-To: References: Message-ID: <94YTjwDpKTnmdD3wymqUsZywBUxpR6v7GvhCq8go7U0=.1359c947-2e72-4196-98c1-c8dc7b220bf2@github.com> On Wed, 29 Jun 2022 08:57:54 GMT, Yasumasa Suenaga wrote: > We see following error if we pass `--with-jvm-variants=minimal --disable-precompiled-headers` to configure script. > > > src/hotspot/share/compiler/disassembler.cpp:841:5: error: 'log_warning' was not declared in this scope; did you mean 'warning'? > [2022-06-29T01:25:45,429Z] 841 | log_warning(os)("Loading hsdis library failed"); > [2022-06-29T01:25:45,429Z] | ^~~~~~~~~~~ > [2022-06-29T01:25:45,429Z] | warning > > > Missing include of log.hpp Looks good and trivial. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.org/jdk/pull/9323 From stuefe at openjdk.org Wed Jun 29 09:15:29 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 29 Jun 2022 09:15:29 GMT Subject: RFR: 8289421: No-PCH build for Minimal VM was broken by JDK-8287001 In-Reply-To: References: Message-ID: <0VjubHQp40rShI3wZ3Me0dM_m8b6Aw8UTKdLJVYp1uo=.e3d1eff0-4417-4cf3-816d-45e1e0782706@github.com> On Wed, 29 Jun 2022 08:57:54 GMT, Yasumasa Suenaga wrote: > We see following error if we pass `--with-jvm-variants=minimal --disable-precompiled-headers` to configure script. > > > src/hotspot/share/compiler/disassembler.cpp:841:5: error: 'log_warning' was not declared in this scope; did you mean 'warning'? > [2022-06-29T01:25:45,429Z] 841 | log_warning(os)("Loading hsdis library failed"); > [2022-06-29T01:25:45,429Z] | ^~~~~~~~~~~ > [2022-06-29T01:25:45,429Z] | warning > > > Missing include of log.hpp Ok. Thanks for the quick fix. ------------- Marked as reviewed by stuefe (Reviewer). PR: https://git.openjdk.org/jdk/pull/9323 From ysuenaga at openjdk.org Wed Jun 29 09:19:30 2022 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Wed, 29 Jun 2022 09:19:30 GMT Subject: RFR: 8289421: No-PCH build for Minimal VM was broken by JDK-8287001 In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 08:57:54 GMT, Yasumasa Suenaga wrote: > We see following error if we pass `--with-jvm-variants=minimal --disable-precompiled-headers` to configure script. > > > src/hotspot/share/compiler/disassembler.cpp:841:5: error: 'log_warning' was not declared in this scope; did you mean 'warning'? > [2022-06-29T01:25:45,429Z] 841 | log_warning(os)("Loading hsdis library failed"); > [2022-06-29T01:25:45,429Z] | ^~~~~~~~~~~ > [2022-06-29T01:25:45,429Z] | warning > > > Missing include of log.hpp Thank you for quick review! I will integrate it after GBA just in case. ------------- PR: https://git.openjdk.org/jdk/pull/9323 From duke at openjdk.org Wed Jun 29 10:37:06 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Wed, 29 Jun 2022 10:37:06 GMT Subject: Integrated: 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long In-Reply-To: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> References: <5VdXfCDIgQMXnjDWmtsd2dZ9lnGu9X-mOuSyWQqzDfI=.8aa5c0c6-ac1d-401c-9aa1-b82e49e4a98a@github.com> Message-ID: On Tue, 7 Jun 2022 17:14:18 GMT, Quan Anh Mai wrote: > Hi, > > This patch implements intrinsics for `Integer/Long::compareUnsigned` using the same approach as the JVM does for long and floating-point comparisons. This allows efficient and reliable usage of unsigned comparison in Java, which is a basic operation and is important for range checks such as discussed in #8620 . > > Thank you very much. This pull request has now been integrated. Changeset: 108cd695 Author: Quan Anh Mai Committer: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/108cd695167f0eed7b778c29b55914998f15b90d Stats: 271 lines in 15 files changed: 260 ins; 0 del; 11 mod 8283726: x86_64 intrinsics for compareUnsigned method in Integer and Long Reviewed-by: kvn, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/9068 From ysuenaga at openjdk.org Wed Jun 29 11:47:54 2022 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Wed, 29 Jun 2022 11:47:54 GMT Subject: Integrated: 8289421: No-PCH build for Minimal VM was broken by JDK-8287001 In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 08:57:54 GMT, Yasumasa Suenaga wrote: > We see following error if we pass `--with-jvm-variants=minimal --disable-precompiled-headers` to configure script. > > > src/hotspot/share/compiler/disassembler.cpp:841:5: error: 'log_warning' was not declared in this scope; did you mean 'warning'? > [2022-06-29T01:25:45,429Z] 841 | log_warning(os)("Loading hsdis library failed"); > [2022-06-29T01:25:45,429Z] | ^~~~~~~~~~~ > [2022-06-29T01:25:45,429Z] | warning > > > Missing include of log.hpp This pull request has now been integrated. Changeset: 167ce4da Author: Yasumasa Suenaga URL: https://git.openjdk.org/jdk/commit/167ce4dae248024ffda0439c3ccc6b12404eadaf Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8289421: No-PCH build for Minimal VM was broken by JDK-8287001 Reviewed-by: mbaesken, jiefu, stuefe ------------- PR: https://git.openjdk.org/jdk/pull/9323 From duke at openjdk.org Wed Jun 29 14:50:59 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Wed, 29 Jun 2022 14:50:59 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls [v2] In-Reply-To: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: > ## Problem > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides the address of the stub and the address of its owner. `relocInfo` has an offset which is an offset from the previously known relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites. > With shared stubs there will be multiple relocations having the same stub address but different owners' addresses. > If we try to generate relocations as we go there will be a case which requires negative offsets: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > > `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward. > Negative offsets reduce the offset range by half. This can increase filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > ## Solution > In this PR creation of stubs is done in two stages. First we collect requests for creating shared stubs: a callee `ciMethod*` and an offset of a call in `CodeBuffer` (see [src/hotspot/share/asm/codeBuffer.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-deb8ab083311ba60c0016dc34d6518579bbee4683c81e8d348982bac897fe8ae)). Then we have the finalisation phase (see [src/hotspot/share/ci/ciEnv.cpp](https://github.com/openjdk/jdk/pull/8816/files#diff-7c032de54e85754d39e080fd24d49b7469543b163f54229eb0631c6b1bf26450)), where `CodeBuffer::finalize_stubs()` creates shared stubs in `CodeBuffer`: a stub and multiple relocations sharing it. The first relocation will have positive offset. The rest will have zero offsets. This approach does not need negative offsets. As creation of relocations and stubs is platform dependent, `CodeBuffer::finalize_stubs()` calls `CodeBuffer::pd_finalize_stubs()` where platforms should put their code. > > This PR provides implementations for x86, x86_64 and aarch64. [src/hotspot/share/asm/codeBuffer.inline.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-c268e3719578f2980edaa27c0eacbe9f620124310108eb65d0f765212c7042eb) provides the `emit_shared_stubs_to_interp` template which x86, x86_64 and aarch64 platforms use. Other platforms can use it too. Platforms supporting shared stubs to the interpreter must have `CodeBuffer::supports_shared_stubs()` returning `true`. > > ## Results > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 820544 | 4592 | 18872 | > | dec-tree | 405280 | 2580 | 22335 | > | naive-bayes | 392384 | 2586 | 21184 | > | log-regression | 362208 | 2450 | 20325 | > | als | 306048 | 2226 | 18161 | > | finagle-chirper | 262304 | 2087 | 12675 | > | movie-lens | 250112 | 1937 | 13617 | > | gauss-mix | 173792 | 1262 | 10304 | > | finagle-http | 164320 | 1392 | 11269 | > | page-rank | 155424 | 1175 | 10330 | > | chi-square | 140384 | 1028 | 9480 | > | akka-uct | 115136 | 541 | 3941 | > | reactors | 43264 | 335 | 2503 | > | scala-stm-bench7 | 42656 | 326 | 3310 | > | philosophers | 36576 | 256 | 2902 | > | scala-doku | 35008 | 231 | 2695 | > | rx-scrabble | 32416 | 273 | 2789 | > | future-genetic | 29408 | 260 | 2339 | > | scrabble | 27968 | 225 | 2477 | > | par-mnemonics | 19584 | 168 | 1689 | > | fj-kmeans | 19296 | 156 | 1647 | > | scala-kmeans | 18080 | 140 | 1629 | > | mnemonics | 17408 | 143 | 1512 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 337065 | 4403 | 19135 | > | dec-tree | 183045 | 2559 | 22071 | > | naive-bayes | 176460 | 2450 | 19782 | > | log-regression | 162555 | 2410 | 20648 | > | als | 121275 | 1980 | 17179 | > | movie-lens | 111915 | 1842 | 13020 | > | finagle-chirper | 106350 | 1947 | 12726 | > | gauss-mix | 81975 | 1251 | 10474 | > | finagle-http | 80895 | 1523 | 12294 | > | page-rank | 68940 | 1146 | 10124 | > | chi-square | 62130 | 974 | 9315 | > | akka-uct | 50220 | 555 | 4263 | > | reactors | 23385 | 371 | 2544 | > | philosophers | 17625 | 259 | 2865 | > | scala-stm-bench7 | 17235 | 295 | 3230 | > | scala-doku | 15600 | 214 | 2698 | > | rx-scrabble | 14190 | 262 | 2770 | > | future-genetic | 13155 | 253 | 2318 | > | scrabble | 12300 | 217 | 2352 | > | fj-kmeans | 8985 | 157 | 1616 | > | par-mnemonics | 8535 | 155 | 1684 | > | scala-kmeans | 8250 | 138 | 1624 | > | mnemonics | 7485 | 134 | 1522 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed Evgeny Astigeevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision: - Merge branch 'master' into JDK-8280481C - Use call offset instead of caller pc - Simplify test - Fix x86 build failures - Remove UseSharedStubs and clarify shared stub use cases - Make SharedStubToInterpRequest ResourceObj and set initial size of SharedStubToInterpRequests to 8 - Update copyright year and add Unimplemented guards - Set UseSharedStubs to true for X86 - Set UseSharedStubs to true for AArch64 - Fix x86 build failure - ... and 10 more: https://git.openjdk.org/jdk/compare/c88b2f38...da3bfb5b ------------- Changes: - all: https://git.openjdk.org/jdk/pull/8816/files - new: https://git.openjdk.org/jdk/pull/8816/files/a249f7da..da3bfb5b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=8816&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=8816&range=00-01 Stats: 183157 lines in 2793 files changed: 96216 ins; 65112 del; 21829 mod Patch: https://git.openjdk.org/jdk/pull/8816.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8816/head:pull/8816 PR: https://git.openjdk.org/jdk/pull/8816 From duke at openjdk.org Wed Jun 29 15:06:43 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Wed, 29 Jun 2022 15:06:43 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls [v2] In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Tue, 28 Jun 2022 23:37:10 GMT, Vladimir Kozlov wrote: > What I meant is that I will not submit my own testing until GitHub action testing is clean. Which is not which means something is wrong with changes: https://github.com/openjdk/jdk/pull/8816/checks?check_run_id=6998367114 > > Please, fix issues and update to latest JDK sources. Sorry, my fault. I did not realise GHA is GitHub actions. Linux x86 failures were caused by Loom project changes. Some x86 parts of the project have not been implemented yet. I updated to the latest JDK sources and they disappeared. I am investigating macOS x64 failure. ------------- PR: https://git.openjdk.org/jdk/pull/8816 From thartmann at openjdk.org Wed Jun 29 15:27:21 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 29 Jun 2022 15:27:21 GMT Subject: [jdk19] RFR: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph Message-ID: Similar to https://github.com/openjdk/jdk/pull/425 and https://github.com/openjdk/jdk/pull/649, entry control to a loop `RegionNode` dies right after parsing (during first IGVN) but the dead loop is not detected/removed. This dead loop then keeps a subgraph alive, which leads to two different failures in later optimization phases that are described below. I assumed that such dead loops should always be detected, but to avoid a full reachability analysis (graph walk to root), C2 only detects and removes "unsafe" dead loops, i.e., dead loops that might cause issues for later optimization phases and should therefore be aggressively removed. See `RegionNode::Ideal` -> `RegionNode::is_unreachable_region` -> `RegionNode::is_possible_unsafe_loop`: https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L541-L549 https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L327-L331 Here is a detailed description of the two failures and the corresponding fixes: 1) `No reachable node should have no use` assert at the end of optimizations (introduced by [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577)): At the beginning of CCP, the types of all nodes are initialized to `top`. Since the following subgraph is not reachable from root due to a dead loop above in the CFG, the types of all unreachable nodes remain top: ![1_BeforeCCP](https://user-images.githubusercontent.com/5312595/176446327-e6fdee4d-49ea-4406-9b15-b29366cd9f55.png) The `Rethrow`, `Phis` and `Region` are removed during IGVN because they are `top` but the `292 CatchProj` remains: ![3_BarrierExpand](https://user-images.githubusercontent.com/5312595/176446385-0374b6ba-7c0b-447d-90f9-c73e3aee4918.png) We then hit the assert because the `CatchProj` has no user. Similar to how https://github.com/openjdk/jdk/pull/3012 was fixed, we need to make sure that when `RegionNode` inputs are cut off because their types are `top`, they are added to the IGVN worklist (see change in `cfgnode.cpp:504`). With that, the entire dead subgraph is removed. 2) `Unknown node on this path` assert while walking the memory graph during scalar replacement: After parsing, the `167 Region` that belongs to a loop loses entry control (marked in red): ![2_Diff_Parsing_IGVN](https://user-images.githubusercontent.com/5312595/176453465-95f48c16-6cb7-4373-baa8-edf5e4fbcde2.png) The dead loop is not detected/removed because it's not considered "unsafe" since the Phis of the dying Region only have a Call user which is considered safe: https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L352-L355 ![DyingRegion](https://user-images.githubusercontent.com/5312595/176469880-f81a7d7e-b769-444a-bf5b-14f8cca1f9af.png) The same can happen with other CFG users (for example, MemBars or Allocates). These scenarios are also covered by the regression test. Later during IGVN, `309 Region` which is part of the now dead subgraph is processed and found to be potentially "unsafe" and unreachable from root: ![1_AfterParsing](https://user-images.githubusercontent.com/5312595/176453110-8a4a587f-f1ef-45bf-8a68-e476f142aa7e.png) It's then removed together with its Phi users, leaving `505 MergeMem` with a top memory input: ![3_MacroExpansion](https://user-images.githubusercontent.com/5312595/176461343-ab446fe0-04a8-48a5-95c2-c8ead6c872cf.png) We then hit the assert when encountering a top memory input while walking the memory graph during scalar replacement. The root cause of the failure is an only partially removed dead subgraph. A similar issue has been fixed long ago by [JDK-8075922](https://bugs.openjdk.org/browse/JDK-8075922), but the fix is incomplete. I propose to aggressively remove such dead subgraphs by walking up the CFG when detecting an unreachable Region belonging to an "unsafe" loop and replacing all nodes by `top`. Special thanks to Christian Hagedorn for helping me with finding a regression test. Thanks, Tobias ------------- Commit messages: - Added missing -XX:+UnlockDiagnosticVMOptions - 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph Changes: https://git.openjdk.org/jdk19/pull/92/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=92&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8284358 Stats: 289 lines in 2 files changed: 250 ins; 3 del; 36 mod Patch: https://git.openjdk.org/jdk19/pull/92.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/92/head:pull/92 PR: https://git.openjdk.org/jdk19/pull/92 From aph at openjdk.org Wed Jun 29 16:31:44 2022 From: aph at openjdk.org (Andrew Haley) Date: Wed, 29 Jun 2022 16:31:44 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg [v2] In-Reply-To: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> Message-ID: <8sSuBL9t7D3HdX0sUzHE3822qv5Yuf9cGuNuvX_2ECA=.3706d933-4972-480b-8628-a17b48eefa88@github.com> > Like class `Register`, class `VMReg` exhibits undefined behaviour, in particular null pointer dereferences. > > The right way to fix this is simple: make instances of `VMReg` point to reified instances of `VMRegImpl`. We do this by creating a static array of `VMRegImpl`, and making all `VMReg` instances point into it, making the code well defined. > > However, while `VMReg` instances are no longer null, and so do not generate compile warnings or errors, there is still a problem in that higher-numbered `VMReg` instances point outside the static array of `VMRegImpl`. This is hard to avoid, given that (as far as I can tell) there is no upper limit on the number of stack slots that can be allocated as `VMReg` instances. While this is in theory UB, it's not likely to cause problems. We could fix this by creating a much larger static array of `VMRegImpl`, up to the largest plausible size of stack offsets. > > We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. Andrew Haley has updated the pull request incrementally with two additional commits since the last revision: - 8289060: Undefined Behaviour in class VMReg - 8289060: Undefined Behaviour in class VMReg ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9276/files - new: https://git.openjdk.org/jdk/pull/9276/files/bb201e77..ab85170c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9276&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9276&range=00-01 Stats: 20 lines in 4 files changed: 7 ins; 2 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/9276.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9276/head:pull/9276 PR: https://git.openjdk.org/jdk/pull/9276 From aph at openjdk.org Wed Jun 29 16:35:06 2022 From: aph at openjdk.org (Andrew Haley) Date: Wed, 29 Jun 2022 16:35:06 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg [v3] In-Reply-To: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> Message-ID: > Like class `Register`, class `VMReg` exhibits undefined behaviour, in particular null pointer dereferences. > > The right way to fix this is simple: make instances of `VMReg` point to reified instances of `VMRegImpl`. We do this by creating a static array of `VMRegImpl`, and making all `VMReg` instances point into it, making the code well defined. > > However, while `VMReg` instances are no longer null, and so do not generate compile warnings or errors, there is still a problem in that higher-numbered `VMReg` instances point outside the static array of `VMRegImpl`. This is hard to avoid, given that (as far as I can tell) there is no upper limit on the number of stack slots that can be allocated as `VMReg` instances. While this is in theory UB, it's not likely to cause problems. We could fix this by creating a much larger static array of `VMRegImpl`, up to the largest plausible size of stack offsets. > > We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. Andrew Haley has updated the pull request incrementally with two additional commits since the last revision: - 8289060: Undefined Behaviour in class VMReg - 8289060: Undefined Behaviour in class VMReg ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9276/files - new: https://git.openjdk.org/jdk/pull/9276/files/ab85170c..62c71eeb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9276&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9276&range=01-02 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9276.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9276/head:pull/9276 PR: https://git.openjdk.org/jdk/pull/9276 From kvn at openjdk.org Wed Jun 29 16:57:28 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Jun 2022 16:57:28 GMT Subject: RFR: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:47:07 GMT, KIRIYAMA Takuya wrote: > The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. > I fixed it to use the appropriate value if the argument of cmd was null. > Please review this change. The fix looks fine but I don't think you can remove the test from Problem list which references other **not** fixed bug. ------------- PR: https://git.openjdk.org/jdk/pull/9318 From kvn at openjdk.org Wed Jun 29 18:32:45 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Jun 2022 18:32:45 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v15] In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 05:23:40 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 >> relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 >> main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 >> stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 >> oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 >> metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 >> scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 >> scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 >> dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 >> handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 >> nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> >> Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 >> relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 >> main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 >> stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 >> oops [0x00007f35189ab388,0x00007f35189ab390] = 8 >> metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 >> scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 >> scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 >> dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 >> handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 >> nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > restore the option OptimizeUnstableIf to diagnostic. > > also update commments. Looks good. I ran it through tier1 to make sure this last change did not break it. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8545 From xliu at openjdk.org Wed Jun 29 18:39:51 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 29 Jun 2022 18:39:51 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v15] In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 05:23:40 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 >> relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 >> main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 >> stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 >> oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 >> metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 >> scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 >> scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 >> dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 >> handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 >> nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> >> Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 >> relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 >> main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 >> stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 >> oops [0x00007f35189ab388,0x00007f35189ab390] = 8 >> metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 >> scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 >> scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 >> dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 >> handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 >> nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > restore the option OptimizeUnstableIf to diagnostic. > > also update commments. I take a look at the build failure on ubuntu/i386. The problem is that apt-get can't install gcc toolchains from a custom repo [http://ppa.launchpad.net/ubuntu-toolchain-r/]. it's not caused by this PR. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From kvn at openjdk.org Wed Jun 29 18:57:56 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Jun 2022 18:57:56 GMT Subject: [jdk19] RFR: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 15:16:03 GMT, Tobias Hartmann wrote: > Similar to https://github.com/openjdk/jdk/pull/425 and https://github.com/openjdk/jdk/pull/649, entry control to a loop `RegionNode` dies right after parsing (during first IGVN) but the dead loop is not detected/removed. This dead loop then keeps a subgraph alive, which leads to two different failures in later optimization phases that are described below. > > I assumed that such dead loops should always be detected, but to avoid a full reachability analysis (graph walk to root), C2 only detects and removes "unsafe" dead loops, i.e., dead loops that might cause issues for later optimization phases and should therefore be aggressively removed. See `RegionNode::Ideal` -> `RegionNode::is_unreachable_region` -> `RegionNode::is_possible_unsafe_loop`: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L541-L549 > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L327-L331 > > Here is a detailed description of the two failures and the corresponding fixes: > > 1) `No reachable node should have no use` assert at the end of optimizations (introduced by [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577)): > > At the beginning of CCP, the types of all nodes are initialized to `top`. Since the following subgraph is not reachable from root due to a dead loop above in the CFG, the types of all unreachable nodes remain top: > ![1_BeforeCCP](https://user-images.githubusercontent.com/5312595/176446327-e6fdee4d-49ea-4406-9b15-b29366cd9f55.png) > > The `Rethrow`, `Phis` and `Region` are removed during IGVN because they are `top` but the `292 CatchProj` remains: > > ![3_BarrierExpand](https://user-images.githubusercontent.com/5312595/176446385-0374b6ba-7c0b-447d-90f9-c73e3aee4918.png) > > We then hit the assert because the `CatchProj` has no user. Similar to how https://github.com/openjdk/jdk/pull/3012 was fixed, we need to make sure that when `RegionNode` inputs are cut off because their types are `top`, they are added to the IGVN worklist (see change in `cfgnode.cpp:504`). With that, the entire dead subgraph is removed. > > 2) `Unknown node on this path` assert while walking the memory graph during scalar replacement: > > After parsing, the `167 Region` that belongs to a loop loses entry control (marked in red): > ![2_Diff_Parsing_IGVN](https://user-images.githubusercontent.com/5312595/176453465-95f48c16-6cb7-4373-baa8-edf5e4fbcde2.png) > > The dead loop is not detected/removed because it's not considered "unsafe" since the Phis of the dying Region only have a Call user which is considered safe: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L352-L355 > > ![DyingRegion](https://user-images.githubusercontent.com/5312595/176469880-f81a7d7e-b769-444a-bf5b-14f8cca1f9af.png) > > The same can happen with other CFG users (for example, MemBars or Allocates). These scenarios are also covered by the regression test. Later during IGVN, `309 Region` which is part of the now dead subgraph is processed and found to be potentially "unsafe" and unreachable from root: > > ![1_AfterParsing](https://user-images.githubusercontent.com/5312595/176453110-8a4a587f-f1ef-45bf-8a68-e476f142aa7e.png) > > It's then removed together with its Phi users, leaving `505 MergeMem` with a top memory input: > > ![3_MacroExpansion](https://user-images.githubusercontent.com/5312595/176461343-ab446fe0-04a8-48a5-95c2-c8ead6c872cf.png) > > We then hit the assert when encountering a top memory input while walking the memory graph during scalar replacement. > > The root cause of the failure is an only partially removed dead subgraph. A similar issue has been fixed long ago by [JDK-8075922](https://bugs.openjdk.org/browse/JDK-8075922), but the fix is incomplete. I propose to aggressively remove such dead subgraphs by walking up the CFG when detecting an unreachable Region belonging to an "unsafe" loop and replacing all nodes by `top`. > > Special thanks to Christian Hagedorn for helping me with finding a regression test. > > Thanks, > Tobias src/hotspot/share/opto/cfgnode.cpp line 504: > 502: } > 503: if( phase->type(n) == Type::TOP ) { > 504: set_req_X(i, NULL, phase); // Ignore TOP inputs This is not guarded by `can_reshape` (call from IGVN). It is not correct to use set_req_X() during parsing. ------------- PR: https://git.openjdk.org/jdk19/pull/92 From kvn at openjdk.org Wed Jun 29 19:00:56 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Jun 2022 19:00:56 GMT Subject: RFR: 8289434: x86_64: Improve comment on gen_continuation_enter() In-Reply-To: <6cp6qV9w9mWrUZKkucomdNg9eHYUVj06unri-TFLlbY=.a793a9e6-9fd9-4451-8fb2-0c8df241e893@github.com> References: <6cp6qV9w9mWrUZKkucomdNg9eHYUVj06unri-TFLlbY=.a793a9e6-9fd9-4451-8fb2-0c8df241e893@github.com> Message-ID: <1nuvsHk8C2AdMO8LSF8XQyGuyX3qEf_NMNoY2DMBQ0k=.96f8bb71-d5d4-4c02-820b-f50aa06dce77@github.com> On Wed, 29 Jun 2022 08:23:36 GMT, Richard Reingruber wrote: > Change code comments for `gen_continuation_enter()` explaining that the generated code will call `Continuation.enter(Continuation c, boolean isContinue)` if the continuation give as first parameter is run for the first time. > > Also mention the special case for resolving this call. Trivial. Thank you for fixing comments in this new code. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9320 From kvn at openjdk.org Wed Jun 29 21:09:31 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Jun 2022 21:09:31 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v15] In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 05:23:40 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 >> relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 >> main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 >> stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 >> oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 >> metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 >> scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 >> scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 >> dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 >> handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 >> nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> >> Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) >> total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 >> relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 >> main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 >> stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 >> oops [0x00007f35189ab388,0x00007f35189ab390] = 8 >> metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 >> scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 >> scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 >> dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 >> handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 >> nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > restore the option OptimizeUnstableIf to diagnostic. > > also update commments. Good. ------------- PR: https://git.openjdk.org/jdk/pull/8545 From duke at openjdk.org Wed Jun 29 21:19:32 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Wed, 29 Jun 2022 21:19:32 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls [v2] In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: <1aiCitX9Awl030q7myghYyOwZNfqJMIdCMmGm9jfoOQ=.2a5a7cd7-9e57-4538-8e22-7a9cf7523343@github.com> On Tue, 28 Jun 2022 23:37:10 GMT, Vladimir Kozlov wrote: >>> GHA testing is not clean. >>> >>> I looked through changes and they seem logically correct. Need more testing. I will wait when GHA is clean. >> >> Vladimir(@vnkozlov), >> Have you got testing results? > >> > GHA testing is not clean. >> > I looked through changes and they seem logically correct. Need more testing. I will wait when GHA is clean. >> >> Vladimir(@vnkozlov), Have you got testing results? > > What I meant is that I will not submit my own testing until GitHub action testing is clean. Which is not which means something is wrong with changes: > https://github.com/openjdk/jdk/pull/8816/checks?check_run_id=6998367114 > > Please, fix issues and update to latest JDK sources. @vnkozlov, with updating to the latest sources everything passed: https://github.com/eastig/jdk/actions/runs/2583924985 ------------- PR: https://git.openjdk.org/jdk/pull/8816 From kvn at openjdk.org Wed Jun 29 22:17:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 29 Jun 2022 22:17:52 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls [v2] In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Wed, 29 Jun 2022 14:50:59 GMT, Evgeny Astigeevich wrote: >> ## Problem >> Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. >> >> Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides the address of the stub and the address of its owner. `relocInfo` has an offset which is an offset from the previously known relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. >> >> Each Java call has: >> - A relocation for a call site. >> - A relocation for a stub to the interpreter. >> - A stub to the interpreter. >> - If far jumps are used (arm64 case): >> - A trampoline relocation. >> - A trampoline. >> >> We cannot avoid creating relocations. They are needed to support patching call sites. >> With shared stubs there will be multiple relocations having the same stub address but different owners' addresses. >> If we try to generate relocations as we go there will be a case which requires negative offsets: >> >> reloc1 ---> 0x0: stub1 >> reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) >> reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) >> >> >> `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward. >> Negative offsets reduce the offset range by half. This can increase filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. >> >> ## Solution >> In this PR creation of stubs is done in two stages. First we collect requests for creating shared stubs: a callee `ciMethod*` and an offset of a call in `CodeBuffer` (see [src/hotspot/share/asm/codeBuffer.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-deb8ab083311ba60c0016dc34d6518579bbee4683c81e8d348982bac897fe8ae)). Then we have the finalisation phase (see [src/hotspot/share/ci/ciEnv.cpp](https://github.com/openjdk/jdk/pull/8816/files#diff-7c032de54e85754d39e080fd24d49b7469543b163f54229eb0631c6b1bf26450)), where `CodeBuffer::finalize_stubs()` creates shared stubs in `CodeBuffer`: a stub and multiple relocations sharing it. The first relocation will have positive offset. The rest will have zero offsets. This approach does not need negative offsets. As creation of relocations and stubs is platform dependent, `CodeBuffer::finalize_stubs()` calls `CodeBuffer::pd_finalize_stubs()` where platforms should put their code. >> >> This PR provides implementations for x86, x86_64 and aarch64. [src/hotspot/share/asm/codeBuffer.inline.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-c268e3719578f2980edaa27c0eacbe9f620124310108eb65d0f765212c7042eb) provides the `emit_shared_stubs_to_interp` template which x86, x86_64 and aarch64 platforms use. Other platforms can use it too. Platforms supporting shared stubs to the interpreter must have `CodeBuffer::supports_shared_stubs()` returning `true`. >> >> ## Results >> **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** >> Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. >> - AArch64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 820544 | 4592 | 18872 | >> | dec-tree | 405280 | 2580 | 22335 | >> | naive-bayes | 392384 | 2586 | 21184 | >> | log-regression | 362208 | 2450 | 20325 | >> | als | 306048 | 2226 | 18161 | >> | finagle-chirper | 262304 | 2087 | 12675 | >> | movie-lens | 250112 | 1937 | 13617 | >> | gauss-mix | 173792 | 1262 | 10304 | >> | finagle-http | 164320 | 1392 | 11269 | >> | page-rank | 155424 | 1175 | 10330 | >> | chi-square | 140384 | 1028 | 9480 | >> | akka-uct | 115136 | 541 | 3941 | >> | reactors | 43264 | 335 | 2503 | >> | scala-stm-bench7 | 42656 | 326 | 3310 | >> | philosophers | 36576 | 256 | 2902 | >> | scala-doku | 35008 | 231 | 2695 | >> | rx-scrabble | 32416 | 273 | 2789 | >> | future-genetic | 29408 | 260 | 2339 | >> | scrabble | 27968 | 225 | 2477 | >> | par-mnemonics | 19584 | 168 | 1689 | >> | fj-kmeans | 19296 | 156 | 1647 | >> | scala-kmeans | 18080 | 140 | 1629 | >> | mnemonics | 17408 | 143 | 1512 | >> +------------------+-------------+----------------------------+---------------------+ >> >> - X86_64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 337065 | 4403 | 19135 | >> | dec-tree | 183045 | 2559 | 22071 | >> | naive-bayes | 176460 | 2450 | 19782 | >> | log-regression | 162555 | 2410 | 20648 | >> | als | 121275 | 1980 | 17179 | >> | movie-lens | 111915 | 1842 | 13020 | >> | finagle-chirper | 106350 | 1947 | 12726 | >> | gauss-mix | 81975 | 1251 | 10474 | >> | finagle-http | 80895 | 1523 | 12294 | >> | page-rank | 68940 | 1146 | 10124 | >> | chi-square | 62130 | 974 | 9315 | >> | akka-uct | 50220 | 555 | 4263 | >> | reactors | 23385 | 371 | 2544 | >> | philosophers | 17625 | 259 | 2865 | >> | scala-stm-bench7 | 17235 | 295 | 3230 | >> | scala-doku | 15600 | 214 | 2698 | >> | rx-scrabble | 14190 | 262 | 2770 | >> | future-genetic | 13155 | 253 | 2318 | >> | scrabble | 12300 | 217 | 2352 | >> | fj-kmeans | 8985 | 157 | 1616 | >> | par-mnemonics | 8535 | 155 | 1684 | >> | scala-kmeans | 8250 | 138 | 1624 | >> | mnemonics | 7485 | 134 | 1522 | >> +------------------+-------------+----------------------------+---------------------+ >> >> >> **Testing: fastdebug and release builds for x86, x86_64 and aarch64** >> - `tier1`...`tier4`: Passed >> - `hotspot/jtreg/compiler/sharedstubs`: Passed > > Evgeny Astigeevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision: > > - Merge branch 'master' into JDK-8280481C > - Use call offset instead of caller pc > - Simplify test > - Fix x86 build failures > - Remove UseSharedStubs and clarify shared stub use cases > - Make SharedStubToInterpRequest ResourceObj and set initial size of SharedStubToInterpRequests to 8 > - Update copyright year and add Unimplemented guards > - Set UseSharedStubs to true for X86 > - Set UseSharedStubs to true for AArch64 > - Fix x86 build failure > - ... and 10 more: https://git.openjdk.org/jdk/compare/48ec70a5...da3bfb5b Good. I submitted testing. Will let you know results. ------------- PR: https://git.openjdk.org/jdk/pull/8816 From kvn at openjdk.org Thu Jun 30 02:07:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 02:07:44 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 09:07:48 GMT, Jatin Bhateja wrote: > Hi All, > > [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. > > X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. > > This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. > > Please find below the JMH micro stats with and without patch. > > > > System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] > > Baseline: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms > > > With Opt: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms > > > Predicated memory operation over sub-word type will be handled in a subsequent patch. > > Kindly review and share your feedback. > > Best Regards, > Jatin src/hotspot/cpu/x86/x86.ad line 1762: > 1760: break; > 1761: case Op_LoadVectorMasked: > 1762: if (!VM_Version::supports_avx512bw() && (is_subword_type(bt) || UseAVX < 1)) { With `UseAVX=0` we clear `supports_avx512bw`. So the test should be if (!VM_Version::supports_avx512bw() && is_subword_type(bt) || UseAVX < 1) And may be naive question. Is VectorMaskGen is used for `mask` node creation? If so, why to have separate support checks for `LoadVectorMasked/StoreVectorMasked`? src/hotspot/share/opto/vectorIntrinsics.cpp line 313: > 311: return true; > 312: } > 313: Why it is placed here without `is_supported` check? Comment does not explain it. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From kvn at openjdk.org Thu Jun 30 02:16:54 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 02:16:54 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:13:36 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments I submitted testing. Will let you know results before approval. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Thu Jun 30 02:16:54 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 02:16:54 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 02:11:56 GMT, Vladimir Kozlov wrote: > I submitted testing. Will let you know results before approval. Thanks a lot for doing this! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From kvn at openjdk.org Thu Jun 30 02:24:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 02:24:41 GMT Subject: RFR: 8287984: AArch64: [vector] Make all bits set vector sharable for match rules In-Reply-To: References: Message-ID: <-x0i-Km0CmLdV1Ucs6OmeY3rsjpmFeoCFVDVlyzUock=.0af6f00d-642b-4ae4-a4e2-16e87c25b6cc@github.com> On Mon, 27 Jun 2022 01:37:03 GMT, Xiaohong Gong wrote: > We have the optimized rules for vector not/and_not in NEON and SVE, like: > > > match(Set dst (XorV src (ReplicateB m1))) ; vector not > match(Set dst (AndV src1 (XorV src2 (ReplicateB m1)))) ; vector and_not > > > where "`m1`" is a ConI node with value -1. And we also have the similar rules for vector mask in SVE like: > > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))) ; mask and_not > > > These rules are not easy to be matched since the "`Replicate`" or "`MaskAll`" node is usually not single used for the `not/and_not` operation. To make these rules be matched as expected, this patch adds the vector (mask) "`not`" pattern to `Matcher::pd_clone_node()` which makes the all bits set vector `(Replicate/MaskAll)` sharable during matching rules. Changes look reasonable. Someone familiar with aarch64 code have to review it too. I will run testing before approval. ------------- PR: https://git.openjdk.org/jdk/pull/9292 From kvn at openjdk.org Thu Jun 30 02:45:33 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 02:45:33 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> On Mon, 20 Jun 2022 07:50:09 GMT, Xiaohong Gong wrote: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". src/hotspot/share/opto/vectornode.cpp line 1802: > 1800: // (AndV (Replicate zero) src) => (Replicate zero) > 1801: // (AndVMask (MaskAll zero) src) => (MaskAll zero) > 1802: if (VectorNode::is_all_zeros_vector(in(1))) { Why you expect it to be `in(1)` instead of `in(2)` as in previous case? Do we create inputs in such order based on mask value? At least add comment explaining it. src/hotspot/share/opto/vectornode.cpp line 1838: > 1836: // (OrV src (Replicate zero)) => src > 1837: // (OrVMask src (MaskAll zero)) => src > 1838: if (VectorNode::is_all_zeros_vector(in(2))) { The same question as for `AndVNode`. src/hotspot/share/opto/vectornode.cpp line 1869: > 1867: // (XorV src src) => (Replicate zero) > 1868: // (XorVMask src src) => (MaskAll zero) > 1869: // Do we really need this? Is `Replicate` asm instruction faster than `XorV`? I understand it may help reduce registers pressure. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From kvn at openjdk.org Thu Jun 30 02:50:39 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 02:50:39 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:13:36 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments I got failure in tier1: test/hotspot/gtest/aarch64/test_assembler_aarch64.cpp:48: Failure Expected equality of these values: insns[i] Which is: 624694305 insns1[i] Which is: 624690209 Ours: [MachCode] 0x0000fffca812c698: 2104 3c25 [/MachCode] Theirs: [MachCode] 0x0000fffc9599d188: 2114 3c25 [/MachCode] ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Thu Jun 30 02:53:33 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 02:53:33 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> References: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> Message-ID: On Thu, 30 Jun 2022 02:41:43 GMT, Vladimir Kozlov wrote: >> This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: >> >> (AndV v (Replicate m1)) => v >> (AndV v (Replicate zero)) => Replicate zero >> (AndV v v) => v >> >> (OrV v (Replicate m1)) => Replicate m1 >> (OrV v (Replicate zero)) => v >> (OrV v v) => v >> >> (XorV v v) => Replicate zero >> >> where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". > > src/hotspot/share/opto/vectornode.cpp line 1869: > >> 1867: // (XorV src src) => (Replicate zero) >> 1868: // (XorVMask src src) => (MaskAll zero) >> 1869: // > > Do we really need this? Is `Replicate` asm instruction faster than `XorV`? I understand it may help reduce registers pressure. Thanks for looking at this patch! I think the main benefit is "(Replicate zero)" is loop invariant which could be hoist outside of the loop. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From xgong at openjdk.org Thu Jun 30 02:57:38 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 02:57:38 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> References: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> Message-ID: On Thu, 30 Jun 2022 02:30:31 GMT, Vladimir Kozlov wrote: >> This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: >> >> (AndV v (Replicate m1)) => v >> (AndV v (Replicate zero)) => Replicate zero >> (AndV v v) => v >> >> (OrV v (Replicate m1)) => Replicate m1 >> (OrV v (Replicate zero)) => v >> (OrV v v) => v >> >> (XorV v v) => Replicate zero >> >> where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". > > src/hotspot/share/opto/vectornode.cpp line 1802: > >> 1800: // (AndV (Replicate zero) src) => (Replicate zero) >> 1801: // (AndVMask (MaskAll zero) src) => (MaskAll zero) >> 1802: if (VectorNode::is_all_zeros_vector(in(1))) { > > Why you expect it to be `in(1)` instead of `in(2)` as in previous case? Do we create inputs in such order based on mask value? > At least add comment explaining it. We also have the case that the all zeros vector is `in(2)` in the followed codes. Please see line 1817. The main reason to do the different handle is the consideration for the predicated vector operations in Vector API. > src/hotspot/share/opto/vectornode.cpp line 1838: > >> 1836: // (OrV src (Replicate zero)) => src >> 1837: // (OrVMask src (MaskAll zero)) => src >> 1838: if (VectorNode::is_all_zeros_vector(in(2))) { > > The same question as for `AndVNode`. The same as `AndVNode`. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From kvn at openjdk.org Thu Jun 30 03:02:28 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 03:02:28 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:13:36 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments I put output into RFE comment. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From kvn at openjdk.org Thu Jun 30 03:06:46 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 03:06:46 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls [v2] In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Wed, 29 Jun 2022 14:50:59 GMT, Evgeny Astigeevich wrote: >> ## Problem >> Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. >> >> Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides the address of the stub and the address of its owner. `relocInfo` has an offset which is an offset from the previously known relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. >> >> Each Java call has: >> - A relocation for a call site. >> - A relocation for a stub to the interpreter. >> - A stub to the interpreter. >> - If far jumps are used (arm64 case): >> - A trampoline relocation. >> - A trampoline. >> >> We cannot avoid creating relocations. They are needed to support patching call sites. >> With shared stubs there will be multiple relocations having the same stub address but different owners' addresses. >> If we try to generate relocations as we go there will be a case which requires negative offsets: >> >> reloc1 ---> 0x0: stub1 >> reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) >> reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) >> >> >> `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward. >> Negative offsets reduce the offset range by half. This can increase filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. >> >> ## Solution >> In this PR creation of stubs is done in two stages. First we collect requests for creating shared stubs: a callee `ciMethod*` and an offset of a call in `CodeBuffer` (see [src/hotspot/share/asm/codeBuffer.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-deb8ab083311ba60c0016dc34d6518579bbee4683c81e8d348982bac897fe8ae)). Then we have the finalisation phase (see [src/hotspot/share/ci/ciEnv.cpp](https://github.com/openjdk/jdk/pull/8816/files#diff-7c032de54e85754d39e080fd24d49b7469543b163f54229eb0631c6b1bf26450)), where `CodeBuffer::finalize_stubs()` creates shared stubs in `CodeBuffer`: a stub and multiple relocations sharing it. The first relocation will have positive offset. The rest will have zero offsets. This approach does not need negative offsets. As creation of relocations and stubs is platform dependent, `CodeBuffer::finalize_stubs()` calls `CodeBuffer::pd_finalize_stubs()` where platforms should put their code. >> >> This PR provides implementations for x86, x86_64 and aarch64. [src/hotspot/share/asm/codeBuffer.inline.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-c268e3719578f2980edaa27c0eacbe9f620124310108eb65d0f765212c7042eb) provides the `emit_shared_stubs_to_interp` template which x86, x86_64 and aarch64 platforms use. Other platforms can use it too. Platforms supporting shared stubs to the interpreter must have `CodeBuffer::supports_shared_stubs()` returning `true`. >> >> ## Results >> **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** >> Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. >> - AArch64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 820544 | 4592 | 18872 | >> | dec-tree | 405280 | 2580 | 22335 | >> | naive-bayes | 392384 | 2586 | 21184 | >> | log-regression | 362208 | 2450 | 20325 | >> | als | 306048 | 2226 | 18161 | >> | finagle-chirper | 262304 | 2087 | 12675 | >> | movie-lens | 250112 | 1937 | 13617 | >> | gauss-mix | 173792 | 1262 | 10304 | >> | finagle-http | 164320 | 1392 | 11269 | >> | page-rank | 155424 | 1175 | 10330 | >> | chi-square | 140384 | 1028 | 9480 | >> | akka-uct | 115136 | 541 | 3941 | >> | reactors | 43264 | 335 | 2503 | >> | scala-stm-bench7 | 42656 | 326 | 3310 | >> | philosophers | 36576 | 256 | 2902 | >> | scala-doku | 35008 | 231 | 2695 | >> | rx-scrabble | 32416 | 273 | 2789 | >> | future-genetic | 29408 | 260 | 2339 | >> | scrabble | 27968 | 225 | 2477 | >> | par-mnemonics | 19584 | 168 | 1689 | >> | fj-kmeans | 19296 | 156 | 1647 | >> | scala-kmeans | 18080 | 140 | 1629 | >> | mnemonics | 17408 | 143 | 1512 | >> +------------------+-------------+----------------------------+---------------------+ >> >> - X86_64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 337065 | 4403 | 19135 | >> | dec-tree | 183045 | 2559 | 22071 | >> | naive-bayes | 176460 | 2450 | 19782 | >> | log-regression | 162555 | 2410 | 20648 | >> | als | 121275 | 1980 | 17179 | >> | movie-lens | 111915 | 1842 | 13020 | >> | finagle-chirper | 106350 | 1947 | 12726 | >> | gauss-mix | 81975 | 1251 | 10474 | >> | finagle-http | 80895 | 1523 | 12294 | >> | page-rank | 68940 | 1146 | 10124 | >> | chi-square | 62130 | 974 | 9315 | >> | akka-uct | 50220 | 555 | 4263 | >> | reactors | 23385 | 371 | 2544 | >> | philosophers | 17625 | 259 | 2865 | >> | scala-stm-bench7 | 17235 | 295 | 3230 | >> | scala-doku | 15600 | 214 | 2698 | >> | rx-scrabble | 14190 | 262 | 2770 | >> | future-genetic | 13155 | 253 | 2318 | >> | scrabble | 12300 | 217 | 2352 | >> | fj-kmeans | 8985 | 157 | 1616 | >> | par-mnemonics | 8535 | 155 | 1684 | >> | scala-kmeans | 8250 | 138 | 1624 | >> | mnemonics | 7485 | 134 | 1522 | >> +------------------+-------------+----------------------------+---------------------+ >> >> >> **Testing: fastdebug and release builds for x86, x86_64 and aarch64** >> - `tier1`...`tier4`: Passed >> - `hotspot/jtreg/compiler/sharedstubs`: Passed > > Evgeny Astigeevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision: > > - Merge branch 'master' into JDK-8280481C > - Use call offset instead of caller pc > - Simplify test > - Fix x86 build failures > - Remove UseSharedStubs and clarify shared stub use cases > - Make SharedStubToInterpRequest ResourceObj and set initial size of SharedStubToInterpRequests to 8 > - Update copyright year and add Unimplemented guards > - Set UseSharedStubs to true for X86 > - Set UseSharedStubs to true for AArch64 > - Fix x86 build failure > - ... and 10 more: https://git.openjdk.org/jdk/compare/e09644d9...da3bfb5b Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/8816 From xgong at openjdk.org Thu Jun 30 03:09:46 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 03:09:46 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 02:46:56 GMT, Vladimir Kozlov wrote: > I got failure in tier1: > > ``` > test/hotspot/gtest/aarch64/test_assembler_aarch64.cpp:48: Failure > Expected equality of these values: > insns[i] > Which is: 624694305 > insns1[i] > Which is: 624690209 > Ours: > [MachCode] > 0x0000fffca812c698: 2104 3c25 > [/MachCode] > Theirs: > [MachCode] > 0x0000fffc9599d188: 2114 3c25 > [/MachCode] > ``` Thanks for pointing out this! My fault. I called wrong assembler function in the new added tests. I will fix it soon! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From kvn at openjdk.org Thu Jun 30 03:11:38 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 03:11:38 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> Message-ID: On Thu, 30 Jun 2022 02:48:50 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectornode.cpp line 1869: >> >>> 1867: // (XorV src src) => (Replicate zero) >>> 1868: // (XorVMask src src) => (MaskAll zero) >>> 1869: // >> >> Do we really need this? Is `Replicate` asm instruction faster than `XorV`? I understand it may help reduce registers pressure. > > Thanks for looking at this patch! I think the main benefit is "(Replicate zero)" is loop invariant which could be hoist outside of the loop. okay ------------- PR: https://git.openjdk.org/jdk/pull/9211 From kvn at openjdk.org Thu Jun 30 03:19:40 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 03:19:40 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> Message-ID: On Thu, 30 Jun 2022 02:54:02 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectornode.cpp line 1802: >> >>> 1800: // (AndV (Replicate zero) src) => (Replicate zero) >>> 1801: // (AndVMask (MaskAll zero) src) => (MaskAll zero) >>> 1802: if (VectorNode::is_all_zeros_vector(in(1))) { >> >> Why you expect it to be `in(1)` instead of `in(2)` as in previous case? Do we create inputs in such order based on mask value? >> At least add comment explaining it. > > We also have the case that the all zeros vector is `in(2)` in the followed codes. Please see line 1817. The main reason to do the different handle is the consideration for the predicated vector operations in Vector API. That is what confuses me. The comment there says: `masked operation requires the unmasked lanes to save the same values in the first operand`. I'm interpreting it as mask should be `in(2)`: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L379 But here you check `in(1)`. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From xgong at openjdk.org Thu Jun 30 03:33:44 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 03:33:44 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> Message-ID: On Thu, 30 Jun 2022 03:17:34 GMT, Vladimir Kozlov wrote: >> We also have the case that the all zeros vector is `in(2)` in the followed codes. Please see line 1817. The main reason to do the different handle is the consideration for the predicated vector operations in Vector API. > > That is what confuses me. The comment there says: `masked operation requires the unmasked lanes to save the same values in the first operand`. > > I'm interpreting it as mask should be `in(2)`: > https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L379 > > But here you check `in(1)`. The comment means: for masked operations, the result of non-masked lanes should be from `in(1)`, and the masked lanes are from the operation results. For `"AndV"` with zero, the results is zero. So if `in(1)` is all zeros vector which is the expected result, no matter whether the `AndV` is masked or not, the result is right (i.e. for masked `AndV`, the non-masked lanes should be from `in(1)`, and the masked lanes should be from the operation result which is also `in(1)`). But if the all zeros vector is `in(2)`, this transformation will the results not right for masked `AndV`. That's why I added `!is_predicated_vector()` limit for the second case while not for the first one. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From xliu at openjdk.org Thu Jun 30 04:03:43 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 30 Jun 2022 04:03:43 GMT Subject: Integrated: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Thu, 5 May 2022 05:30:06 GMT, Xin Liu wrote: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. Besides runtime, the codecache utilization reduces from 1648 bytes to 1192 bytes, or 27.6% > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > Compiled method (c2) 281 636 4 MyBenchmark::testMethod (50 bytes) > total in heap [0x00007fa1e49ab510,0x00007fa1e49abb80] = 1648 > relocation [0x00007fa1e49ab670,0x00007fa1e49ab6b0] = 64 > main code [0x00007fa1e49ab6c0,0x00007fa1e49ab940] = 640 > stub code [0x00007fa1e49ab940,0x00007fa1e49ab968] = 40 > oops [0x00007fa1e49ab968,0x00007fa1e49ab978] = 16 > metadata [0x00007fa1e49ab978,0x00007fa1e49ab990] = 24 > scopes data [0x00007fa1e49ab990,0x00007fa1e49aba60] = 208 > scopes pcs [0x00007fa1e49aba60,0x00007fa1e49abb30] = 208 > dependencies [0x00007fa1e49abb30,0x00007fa1e49abb38] = 8 > handler table [0x00007fa1e49abb38,0x00007fa1e49abb68] = 48 > nul chk table [0x00007fa1e49abb68,0x00007fa1e49abb80] = 24 > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > > Compiled method (c2) 288 633 4 MyBenchmark::testMethod (50 bytes) > total in heap [0x00007f35189ab010,0x00007f35189ab4b8] = 1192 > relocation [0x00007f35189ab170,0x00007f35189ab1a0] = 48 > main code [0x00007f35189ab1a0,0x00007f35189ab360] = 448 > stub code [0x00007f35189ab360,0x00007f35189ab388] = 40 > oops [0x00007f35189ab388,0x00007f35189ab390] = 8 > metadata [0x00007f35189ab390,0x00007f35189ab398] = 8 > scopes data [0x00007f35189ab398,0x00007f35189ab408] = 112 > scopes pcs [0x00007f35189ab408,0x00007f35189ab488] = 128 > dependencies [0x00007f35189ab488,0x00007f35189ab490] = 8 > handler table [0x00007f35189ab490,0x00007f35189ab4a8] = 24 > nul chk table [0x00007f35189ab4a8,0x00007f35189ab4b8] = 16 > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. This pull request has now been integrated. Changeset: 31e50f2c Author: Xin Liu URL: https://git.openjdk.org/jdk/commit/31e50f2c7642b046dc9ea1de8ec245dcbc4e1926 Stats: 340 lines in 14 files changed: 330 ins; 0 del; 10 mod 8286104: use aggressive liveness for unstable_if traps Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/8545 From kvn at openjdk.org Thu Jun 30 04:52:38 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 04:52:38 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: <0dJTinc9Yq7tC7FaGHGLK7TxJbMVHrBi2jibc5NMf3I=.1571739c-fd32-4ee8-b1fe-6287131320a0@github.com> On Mon, 20 Jun 2022 07:50:09 GMT, Xiaohong Gong wrote: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". @jatin-bhateja can you look on this. It is shared code and x86 is also affected. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From kvn at openjdk.org Thu Jun 30 04:52:39 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 04:52:39 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: <6JoVanlMpB_1LAukbkHpvzyqO-ZALtXbAQYcoAsH3Dw=.03400a5f-0ec7-4a3f-ad9f-379fe779bf8d@github.com> Message-ID: <7avSd4-x1annRZSfmq6KDF268xMJbThtdpj4gKSYkAk=.c923562d-60a0-4c15-a889-745178e439e1@github.com> On Thu, 30 Jun 2022 03:29:55 GMT, Xiaohong Gong wrote: > The comment means: for masked operations, the result of non-masked lanes should be from `in(1)`, and the masked lanes are from the operation results. Okay, I got it finally. Thank you for explanation. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From kvn at openjdk.org Thu Jun 30 04:54:40 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 04:54:40 GMT Subject: RFR: 8287984: AArch64: [vector] Make all bits set vector sharable for match rules In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 01:37:03 GMT, Xiaohong Gong wrote: > We have the optimized rules for vector not/and_not in NEON and SVE, like: > > > match(Set dst (XorV src (ReplicateB m1))) ; vector not > match(Set dst (AndV src1 (XorV src2 (ReplicateB m1)))) ; vector and_not > > > where "`m1`" is a ConI node with value -1. And we also have the similar rules for vector mask in SVE like: > > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))) ; mask and_not > > > These rules are not easy to be matched since the "`Replicate`" or "`MaskAll`" node is usually not single used for the `not/and_not` operation. To make these rules be matched as expected, this patch adds the vector (mask) "`not`" pattern to `Matcher::pd_clone_node()` which makes the all bits set vector `(Replicate/MaskAll)` sharable during matching rules. My testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9292 From kvn at openjdk.org Thu Jun 30 04:58:39 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 04:58:39 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 07:50:09 GMT, Xiaohong Gong wrote: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". Tobias already ran testing and results are good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9211 From thartmann at openjdk.org Thu Jun 30 05:07:27 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Jun 2022 05:07:27 GMT Subject: [jdk19] RFR: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 18:50:29 GMT, Vladimir Kozlov wrote: >> Similar to https://github.com/openjdk/jdk/pull/425 and https://github.com/openjdk/jdk/pull/649, entry control to a loop `RegionNode` dies right after parsing (during first IGVN) but the dead loop is not detected/removed. This dead loop then keeps a subgraph alive, which leads to two different failures in later optimization phases that are described below. >> >> I assumed that such dead loops should always be detected, but to avoid a full reachability analysis (graph walk to root), C2 only detects and removes "unsafe" dead loops, i.e., dead loops that might cause issues for later optimization phases and should therefore be aggressively removed. See `RegionNode::Ideal` -> `RegionNode::is_unreachable_region` -> `RegionNode::is_possible_unsafe_loop`: >> >> https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L541-L549 >> >> https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L327-L331 >> >> Here is a detailed description of the two failures and the corresponding fixes: >> >> 1) `No reachable node should have no use` assert at the end of optimizations (introduced by [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577)): >> >> At the beginning of CCP, the types of all nodes are initialized to `top`. Since the following subgraph is not reachable from root due to a dead loop above in the CFG, the types of all unreachable nodes remain top: >> ![1_BeforeCCP](https://user-images.githubusercontent.com/5312595/176446327-e6fdee4d-49ea-4406-9b15-b29366cd9f55.png) >> >> The `Rethrow`, `Phis` and `Region` are removed during IGVN because they are `top` but the `292 CatchProj` remains: >> >> ![3_BarrierExpand](https://user-images.githubusercontent.com/5312595/176446385-0374b6ba-7c0b-447d-90f9-c73e3aee4918.png) >> >> We then hit the assert because the `CatchProj` has no user. Similar to how https://github.com/openjdk/jdk/pull/3012 was fixed, we need to make sure that when `RegionNode` inputs are cut off because their types are `top`, they are added to the IGVN worklist (see change in `cfgnode.cpp:504`). With that, the entire dead subgraph is removed. >> >> 2) `Unknown node on this path` assert while walking the memory graph during scalar replacement: >> >> After parsing, the `167 Region` that belongs to a loop loses entry control (marked in red): >> ![2_Diff_Parsing_IGVN](https://user-images.githubusercontent.com/5312595/176453465-95f48c16-6cb7-4373-baa8-edf5e4fbcde2.png) >> >> The dead loop is not detected/removed because it's not considered "unsafe" since the Phis of the dying Region only have a Call user which is considered safe: >> >> https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L352-L355 >> >> ![DyingRegion](https://user-images.githubusercontent.com/5312595/176469880-f81a7d7e-b769-444a-bf5b-14f8cca1f9af.png) >> >> The same can happen with other CFG users (for example, MemBars or Allocates). These scenarios are also covered by the regression test. Later during IGVN, `309 Region` which is part of the now dead subgraph is processed and found to be potentially "unsafe" and unreachable from root: >> >> ![1_AfterParsing](https://user-images.githubusercontent.com/5312595/176453110-8a4a587f-f1ef-45bf-8a68-e476f142aa7e.png) >> >> It's then removed together with its Phi users, leaving `505 MergeMem` with a top memory input: >> >> ![3_MacroExpansion](https://user-images.githubusercontent.com/5312595/176461343-ab446fe0-04a8-48a5-95c2-c8ead6c872cf.png) >> >> We then hit the assert when encountering a top memory input while walking the memory graph during scalar replacement. >> >> The root cause of the failure is an only partially removed dead subgraph. A similar issue has been fixed long ago by [JDK-8075922](https://bugs.openjdk.org/browse/JDK-8075922), but the fix is incomplete. I propose to aggressively remove such dead subgraphs by walking up the CFG when detecting an unreachable Region belonging to an "unsafe" loop and replacing all nodes by `top`. >> >> Special thanks to Christian Hagedorn for helping me with finding a regression test. >> >> Thanks, >> Tobias > > src/hotspot/share/opto/cfgnode.cpp line 504: > >> 502: } >> 503: if( phase->type(n) == Type::TOP ) { >> 504: set_req_X(i, NULL, phase); // Ignore TOP inputs > > This is not guarded by `can_reshape` (call from IGVN). It is not correct to use set_req_X() during parsing. Since [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577), there are two versions of `Node::set_req_X`. The new one takes a `PhaseGVN` and falls back to a regular `set_req`: https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/phaseX.cpp#L2159-L2165 ------------- PR: https://git.openjdk.org/jdk19/pull/92 From jbhateja at openjdk.org Thu Jun 30 05:21:31 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 30 Jun 2022 05:21:31 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 07:50:09 GMT, Xiaohong Gong wrote: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". We can also handle following vector constant folding cases, such patterns may get generated in graph as a result of you newly added transforms. :- AndV (Replicate Const1) (Replicate Const2) => Replicate (Con1 And Con2) OrV (Replicate Const1) (Replicate Const2) => Replicate (Con1 Or Con2 ) Other than above patch looks good to me. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From kvn at openjdk.org Thu Jun 30 06:23:32 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 06:23:32 GMT Subject: [jdk19] RFR: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph In-Reply-To: References: Message-ID: <20AYxov0YOWMYbnH0Q04xeq-8yHCD88Os_MCd1sMiaA=.47eb80ba-0f48-4e01-ba63-3dc477387fed@github.com> On Wed, 29 Jun 2022 15:16:03 GMT, Tobias Hartmann wrote: > Similar to https://github.com/openjdk/jdk/pull/425 and https://github.com/openjdk/jdk/pull/649, entry control to a loop `RegionNode` dies right after parsing (during first IGVN) but the dead loop is not detected/removed. This dead loop then keeps a subgraph alive, which leads to two different failures in later optimization phases that are described below. > > I assumed that such dead loops should always be detected, but to avoid a full reachability analysis (graph walk to root), C2 only detects and removes "unsafe" dead loops, i.e., dead loops that might cause issues for later optimization phases and should therefore be aggressively removed. See `RegionNode::Ideal` -> `RegionNode::is_unreachable_region` -> `RegionNode::is_possible_unsafe_loop`: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L541-L549 > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L327-L331 > > Here is a detailed description of the two failures and the corresponding fixes: > > 1) `No reachable node should have no use` assert at the end of optimizations (introduced by [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577)): > > At the beginning of CCP, the types of all nodes are initialized to `top`. Since the following subgraph is not reachable from root due to a dead loop above in the CFG, the types of all unreachable nodes remain top: > ![1_BeforeCCP](https://user-images.githubusercontent.com/5312595/176446327-e6fdee4d-49ea-4406-9b15-b29366cd9f55.png) > > The `Rethrow`, `Phis` and `Region` are removed during IGVN because they are `top` but the `292 CatchProj` remains: > > ![3_BarrierExpand](https://user-images.githubusercontent.com/5312595/176446385-0374b6ba-7c0b-447d-90f9-c73e3aee4918.png) > > We then hit the assert because the `CatchProj` has no user. Similar to how https://github.com/openjdk/jdk/pull/3012 was fixed, we need to make sure that when `RegionNode` inputs are cut off because their types are `top`, they are added to the IGVN worklist (see change in `cfgnode.cpp:504`). With that, the entire dead subgraph is removed. > > 2) `Unknown node on this path` assert while walking the memory graph during scalar replacement: > > After parsing, the `167 Region` that belongs to a loop loses entry control (marked in red): > ![2_Diff_Parsing_IGVN](https://user-images.githubusercontent.com/5312595/176453465-95f48c16-6cb7-4373-baa8-edf5e4fbcde2.png) > > The dead loop is not detected/removed because it's not considered "unsafe" since the Phis of the dying Region only have a Call user which is considered safe: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L352-L355 > > ![DyingRegion](https://user-images.githubusercontent.com/5312595/176469880-f81a7d7e-b769-444a-bf5b-14f8cca1f9af.png) > > The same can happen with other CFG users (for example, MemBars or Allocates). These scenarios are also covered by the regression test. Later during IGVN, `309 Region` which is part of the now dead subgraph is processed and found to be potentially "unsafe" and unreachable from root: > > ![1_AfterParsing](https://user-images.githubusercontent.com/5312595/176453110-8a4a587f-f1ef-45bf-8a68-e476f142aa7e.png) > > It's then removed together with its Phi users, leaving `505 MergeMem` with a top memory input: > > ![3_MacroExpansion](https://user-images.githubusercontent.com/5312595/176461343-ab446fe0-04a8-48a5-95c2-c8ead6c872cf.png) > > We then hit the assert when encountering a top memory input while walking the memory graph during scalar replacement. > > The root cause of the failure is an only partially removed dead subgraph. A similar issue has been fixed long ago by [JDK-8075922](https://bugs.openjdk.org/browse/JDK-8075922), but the fix is incomplete. I propose to aggressively remove such dead subgraphs by walking up the CFG when detecting an unreachable Region belonging to an "unsafe" loop and replacing all nodes by `top`. > > Special thanks to Christian Hagedorn for helping me with finding a regression test. > > Thanks, > Tobias Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/92 From kvn at openjdk.org Thu Jun 30 06:23:35 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 06:23:35 GMT Subject: [jdk19] RFR: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 05:03:57 GMT, Tobias Hartmann wrote: >> src/hotspot/share/opto/cfgnode.cpp line 504: >> >>> 502: } >>> 503: if( phase->type(n) == Type::TOP ) { >>> 504: set_req_X(i, NULL, phase); // Ignore TOP inputs >> >> This is not guarded by `can_reshape` (call from IGVN). It is not correct to use set_req_X() during parsing. > > Since [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577), there are two versions of `Node::set_req_X`. The new one takes a `PhaseGVN` and falls back to a regular `set_req`: > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/phaseX.cpp#L2159-L2165 Got it. ------------- PR: https://git.openjdk.org/jdk19/pull/92 From xgong at openjdk.org Thu Jun 30 06:31:57 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 06:31:57 GMT Subject: RFR: 8287984: AArch64: [vector] Make all bits set vector sharable for match rules In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 01:37:03 GMT, Xiaohong Gong wrote: > We have the optimized rules for vector not/and_not in NEON and SVE, like: > > > match(Set dst (XorV src (ReplicateB m1))) ; vector not > match(Set dst (AndV src1 (XorV src2 (ReplicateB m1)))) ; vector and_not > > > where "`m1`" is a ConI node with value -1. And we also have the similar rules for vector mask in SVE like: > > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))) ; mask and_not > > > These rules are not easy to be matched since the "`Replicate`" or "`MaskAll`" node is usually not single used for the `not/and_not` operation. To make these rules be matched as expected, this patch adds the vector (mask) "`not`" pattern to `Matcher::pd_clone_node()` which makes the all bits set vector `(Replicate/MaskAll)` sharable during matching rules. Thanks a lot for your tests and review! @nick-arm @nsjian, may I got the review for your side? Thanks a lot for your time! ------------- PR: https://git.openjdk.org/jdk/pull/9292 From thartmann at openjdk.org Thu Jun 30 06:43:38 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Jun 2022 06:43:38 GMT Subject: [jdk19] RFR: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 15:16:03 GMT, Tobias Hartmann wrote: > Similar to https://github.com/openjdk/jdk/pull/425 and https://github.com/openjdk/jdk/pull/649, entry control to a loop `RegionNode` dies right after parsing (during first IGVN) but the dead loop is not detected/removed. This dead loop then keeps a subgraph alive, which leads to two different failures in later optimization phases that are described below. > > I assumed that such dead loops should always be detected, but to avoid a full reachability analysis (graph walk to root), C2 only detects and removes "unsafe" dead loops, i.e., dead loops that might cause issues for later optimization phases and should therefore be aggressively removed. See `RegionNode::Ideal` -> `RegionNode::is_unreachable_region` -> `RegionNode::is_possible_unsafe_loop`: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L541-L549 > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L327-L331 > > Here is a detailed description of the two failures and the corresponding fixes: > > 1) `No reachable node should have no use` assert at the end of optimizations (introduced by [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577)): > > At the beginning of CCP, the types of all nodes are initialized to `top`. Since the following subgraph is not reachable from root due to a dead loop above in the CFG, the types of all unreachable nodes remain top: > ![1_BeforeCCP](https://user-images.githubusercontent.com/5312595/176446327-e6fdee4d-49ea-4406-9b15-b29366cd9f55.png) > > The `Rethrow`, `Phis` and `Region` are removed during IGVN because they are `top` but the `292 CatchProj` remains: > > ![3_BarrierExpand](https://user-images.githubusercontent.com/5312595/176446385-0374b6ba-7c0b-447d-90f9-c73e3aee4918.png) > > We then hit the assert because the `CatchProj` has no user. Similar to how https://github.com/openjdk/jdk/pull/3012 was fixed, we need to make sure that when `RegionNode` inputs are cut off because their types are `top`, they are added to the IGVN worklist (see change in `cfgnode.cpp:504`). With that, the entire dead subgraph is removed. > > 2) `Unknown node on this path` assert while walking the memory graph during scalar replacement: > > After parsing, the `167 Region` that belongs to a loop loses entry control (marked in red): > ![2_Diff_Parsing_IGVN](https://user-images.githubusercontent.com/5312595/176453465-95f48c16-6cb7-4373-baa8-edf5e4fbcde2.png) > > The dead loop is not detected/removed because it's not considered "unsafe" since the Phis of the dying Region only have a Call user which is considered safe: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L352-L355 > > ![DyingRegion](https://user-images.githubusercontent.com/5312595/176469880-f81a7d7e-b769-444a-bf5b-14f8cca1f9af.png) > > The same can happen with other CFG users (for example, MemBars or Allocates). These scenarios are also covered by the regression test. Later during IGVN, `309 Region` which is part of the now dead subgraph is processed and found to be potentially "unsafe" and unreachable from root: > > ![1_AfterParsing](https://user-images.githubusercontent.com/5312595/176453110-8a4a587f-f1ef-45bf-8a68-e476f142aa7e.png) > > It's then removed together with its Phi users, leaving `505 MergeMem` with a top memory input: > > ![3_MacroExpansion](https://user-images.githubusercontent.com/5312595/176461343-ab446fe0-04a8-48a5-95c2-c8ead6c872cf.png) > > We then hit the assert when encountering a top memory input while walking the memory graph during scalar replacement. > > The root cause of the failure is an only partially removed dead subgraph. A similar issue has been fixed long ago by [JDK-8075922](https://bugs.openjdk.org/browse/JDK-8075922), but the fix is incomplete. I propose to aggressively remove such dead subgraphs by walking up the CFG when detecting an unreachable Region belonging to an "unsafe" loop and replacing all nodes by `top`. > > Special thanks to Christian Hagedorn for helping me with finding a regression test. > > Thanks, > Tobias Thanks for the review, Vladimir! ------------- PR: https://git.openjdk.org/jdk19/pull/92 From xgong at openjdk.org Thu Jun 30 07:06:40 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 07:06:40 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v7] In-Reply-To: References: Message-ID: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Fix assembler test issue ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9037/files - new: https://git.openjdk.org/jdk/pull/9037/files/04a2d827..061a19fb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=05-06 Stats: 8 lines in 2 files changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Thu Jun 30 07:09:13 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 07:09:13 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v6] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 02:59:26 GMT, Vladimir Kozlov wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Address review comments > > I put output into RFE comment. Hi @vnkozlov , the fix is pushed to this PR. Would you mind running the test again? Thanks so much! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From tholenstein at openjdk.org Thu Jun 30 07:17:59 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Jun 2022 07:17:59 GMT Subject: RFR: JDK-8287094: IGV: show node input numbers in edge tooltips [v4] In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 11:21:26 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/utils/IdealGraphVisualizer/Graph/src/main/java/com/sun/hotspot/igv/graph/FigureConnection.java >> >> String concatenation >> >> Co-authored-by: Christian Hagedorn > > Marked as reviewed by chagedorn (Reviewer). @chhagedorn and @TobiHartmann thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/9273 From tholenstein at openjdk.org Thu Jun 30 07:18:01 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 30 Jun 2022 07:18:01 GMT Subject: Integrated: JDK-8287094: IGV: show node input numbers in edge tooltips In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:29:34 GMT, Tobias Holenstein wrote: > For nodes with many inputs, such as safepoints, it is difficult and error-prone to figure out the exact input number of a given incoming edge. > > Extend the Edge Tooltips to include the input number of the destination node: > **Before** `91 Addl -> 92 SafePoint` > **Now** `91 Addl -> 92 SafePoint [NR]` > ![edge](https://user-images.githubusercontent.com/71546117/175506945-6f5137d2-7647-4acb-a135-8fcb719df3e6.png) This pull request has now been integrated. Changeset: 28c5e483 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/28c5e483a80e0291bc784488ea15545dbecb257d Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod 8287094: IGV: show node input numbers in edge tooltips Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9273 From xgong at openjdk.org Thu Jun 30 07:28:43 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 07:28:43 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v7] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 07:06:40 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Fix assembler test issue Hi @nick-arm , could you please help to take a look at the aarch64 part changes? Thanks so much for your time! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Thu Jun 30 07:38:41 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 07:38:41 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 07:50:09 GMT, Xiaohong Gong wrote: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". > Thanks so much for the advice @jatin-bhateja ! I basically agree with this idea! It seems the similar optimization can also be applied to other binary arithmetic vector operations like `add, sub, mul, div, shift` ? So do you think it's better we create another patch special to handle the constant folding for such vector nodes? We'd better find a better way to handle this while not add the same transformation for each node. WDYT? ------------- PR: https://git.openjdk.org/jdk/pull/9211 From xgong at openjdk.org Thu Jun 30 07:38:43 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 07:38:43 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 04:54:55 GMT, Vladimir Kozlov wrote: > Tobias already ran testing and results are good. Thanks for the review and testing @vnkozlov ! ------------- PR: https://git.openjdk.org/jdk/pull/9211 From chagedorn at openjdk.org Thu Jun 30 07:45:49 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 30 Jun 2022 07:45:49 GMT Subject: [jdk19] RFR: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 15:16:03 GMT, Tobias Hartmann wrote: > Similar to https://github.com/openjdk/jdk/pull/425 and https://github.com/openjdk/jdk/pull/649, entry control to a loop `RegionNode` dies right after parsing (during first IGVN) but the dead loop is not detected/removed. This dead loop then keeps a subgraph alive, which leads to two different failures in later optimization phases that are described below. > > I assumed that such dead loops should always be detected, but to avoid a full reachability analysis (graph walk to root), C2 only detects and removes "unsafe" dead loops, i.e., dead loops that might cause issues for later optimization phases and should therefore be aggressively removed. See `RegionNode::Ideal` -> `RegionNode::is_unreachable_region` -> `RegionNode::is_possible_unsafe_loop`: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L541-L549 > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L327-L331 > > Here is a detailed description of the two failures and the corresponding fixes: > > 1) `No reachable node should have no use` assert at the end of optimizations (introduced by [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577)): > > At the beginning of CCP, the types of all nodes are initialized to `top`. Since the following subgraph is not reachable from root due to a dead loop above in the CFG, the types of all unreachable nodes remain top: > ![1_BeforeCCP](https://user-images.githubusercontent.com/5312595/176446327-e6fdee4d-49ea-4406-9b15-b29366cd9f55.png) > > The `Rethrow`, `Phis` and `Region` are removed during IGVN because they are `top` but the `292 CatchProj` remains: > > ![3_BarrierExpand](https://user-images.githubusercontent.com/5312595/176446385-0374b6ba-7c0b-447d-90f9-c73e3aee4918.png) > > We then hit the assert because the `CatchProj` has no user. Similar to how https://github.com/openjdk/jdk/pull/3012 was fixed, we need to make sure that when `RegionNode` inputs are cut off because their types are `top`, they are added to the IGVN worklist (see change in `cfgnode.cpp:504`). With that, the entire dead subgraph is removed. > > 2) `Unknown node on this path` assert while walking the memory graph during scalar replacement: > > After parsing, the `167 Region` that belongs to a loop loses entry control (marked in red): > ![2_Diff_Parsing_IGVN](https://user-images.githubusercontent.com/5312595/176453465-95f48c16-6cb7-4373-baa8-edf5e4fbcde2.png) > > The dead loop is not detected/removed because it's not considered "unsafe" since the Phis of the dying Region only have a Call user which is considered safe: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L352-L355 > > ![DyingRegion](https://user-images.githubusercontent.com/5312595/176469880-f81a7d7e-b769-444a-bf5b-14f8cca1f9af.png) > > The same can happen with other CFG users (for example, MemBars or Allocates). These scenarios are also covered by the regression test. Later during IGVN, `309 Region` which is part of the now dead subgraph is processed and found to be potentially "unsafe" and unreachable from root: > > ![1_AfterParsing](https://user-images.githubusercontent.com/5312595/176453110-8a4a587f-f1ef-45bf-8a68-e476f142aa7e.png) > > It's then removed together with its Phi users, leaving `505 MergeMem` with a top memory input: > > ![3_MacroExpansion](https://user-images.githubusercontent.com/5312595/176461343-ab446fe0-04a8-48a5-95c2-c8ead6c872cf.png) > > We then hit the assert when encountering a top memory input while walking the memory graph during scalar replacement. > > The root cause of the failure is an only partially removed dead subgraph. A similar issue has been fixed long ago by [JDK-8075922](https://bugs.openjdk.org/browse/JDK-8075922), but the fix is incomplete. I propose to aggressively remove such dead subgraphs by walking up the CFG when detecting an unreachable Region belonging to an "unsafe" loop and replacing all nodes by `top`. > > Special thanks to Christian Hagedorn for helping me with finding a regression test. > > Thanks, > Tobias Nice analysis and finding more tests to cover all the cases! Looks good to me! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/92 From ngasson at openjdk.org Thu Jun 30 08:32:52 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Thu, 30 Jun 2022 08:32:52 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v7] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 07:06:40 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Fix assembler test issue src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1409: > 1407: // Encode to "whilelow" for the remaining cases. > 1408: mov(rscratch1, lane_cnt); > 1409: sve_whilelow(dst, size, zr, rscratch1); Why not move these two lines into an `else {` block and then you can get rid of the early returns on lines 1398, 1401, and 1404? I think that would make the logic easier to follow. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From ngasson at openjdk.org Thu Jun 30 08:35:46 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Thu, 30 Jun 2022 08:35:46 GMT Subject: RFR: 8287984: AArch64: [vector] Make all bits set vector sharable for match rules In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 01:37:03 GMT, Xiaohong Gong wrote: > We have the optimized rules for vector not/and_not in NEON and SVE, like: > > > match(Set dst (XorV src (ReplicateB m1))) ; vector not > match(Set dst (AndV src1 (XorV src2 (ReplicateB m1)))) ; vector and_not > > > where "`m1`" is a ConI node with value -1. And we also have the similar rules for vector mask in SVE like: > > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))) ; mask and_not > > > These rules are not easy to be matched since the "`Replicate`" or "`MaskAll`" node is usually not single used for the `not/and_not` operation. To make these rules be matched as expected, this patch adds the vector (mask) "`not`" pattern to `Matcher::pd_clone_node()` which makes the all bits set vector `(Replicate/MaskAll)` sharable during matching rules. The `cpu/aarch64` changes look OK. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.org/jdk/pull/9292 From xgong at openjdk.org Thu Jun 30 08:35:53 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 08:35:53 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v7] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 08:28:47 GMT, Nick Gasson wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix assembler test issue > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1409: > >> 1407: // Encode to "whilelow" for the remaining cases. >> 1408: mov(rscratch1, lane_cnt); >> 1409: sve_whilelow(dst, size, zr, rscratch1); > > Why not move these two lines into an `else {` block and then you can get rid of the early returns on lines 1398, 1401, and 1404? I think that would make the logic easier to follow. Good idea! Thanks, I will change this later. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From aph at openjdk.org Thu Jun 30 08:38:32 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 30 Jun 2022 08:38:32 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg [v4] In-Reply-To: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> Message-ID: <-g2xnAbFWX2EJ_Pg728btZGewLTHvJ_ynehX_VC67qw=.1e85a6d5-2896-4d36-8b1c-6f9ae29ad2dd@github.com> > Like class `Register`, class `VMReg` exhibits undefined behaviour, in particular null pointer dereferences. > > The right way to fix this is simple: make instances of `VMReg` point to reified instances of `VMRegImpl`. We do this by creating a static array of `VMRegImpl`, and making all `VMReg` instances point into it, making the code well defined. > > However, while `VMReg` instances are no longer null, and so do not generate compile warnings or errors, there is still a problem in that higher-numbered `VMReg` instances point outside the static array of `VMRegImpl`. This is hard to avoid, given that (as far as I can tell) there is no upper limit on the number of stack slots that can be allocated as `VMReg` instances. While this is in theory UB, it's not likely to cause problems. We could fix this by creating a much larger static array of `VMRegImpl`, up to the largest plausible size of stack offsets. > > We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: 8289060: Undefined Behaviour in class VMReg ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9276/files - new: https://git.openjdk.org/jdk/pull/9276/files/62c71eeb..1cf884f1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9276&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9276&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9276.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9276/head:pull/9276 PR: https://git.openjdk.org/jdk/pull/9276 From thartmann at openjdk.org Thu Jun 30 08:40:28 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 30 Jun 2022 08:40:28 GMT Subject: [jdk19] RFR: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 15:16:03 GMT, Tobias Hartmann wrote: > Similar to https://github.com/openjdk/jdk/pull/425 and https://github.com/openjdk/jdk/pull/649, entry control to a loop `RegionNode` dies right after parsing (during first IGVN) but the dead loop is not detected/removed. This dead loop then keeps a subgraph alive, which leads to two different failures in later optimization phases that are described below. > > I assumed that such dead loops should always be detected, but to avoid a full reachability analysis (graph walk to root), C2 only detects and removes "unsafe" dead loops, i.e., dead loops that might cause issues for later optimization phases and should therefore be aggressively removed. See `RegionNode::Ideal` -> `RegionNode::is_unreachable_region` -> `RegionNode::is_possible_unsafe_loop`: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L541-L549 > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L327-L331 > > Here is a detailed description of the two failures and the corresponding fixes: > > 1) `No reachable node should have no use` assert at the end of optimizations (introduced by [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577)): > > At the beginning of CCP, the types of all nodes are initialized to `top`. Since the following subgraph is not reachable from root due to a dead loop above in the CFG, the types of all unreachable nodes remain top: > ![1_BeforeCCP](https://user-images.githubusercontent.com/5312595/176446327-e6fdee4d-49ea-4406-9b15-b29366cd9f55.png) > > The `Rethrow`, `Phis` and `Region` are removed during IGVN because they are `top` but the `292 CatchProj` remains: > > ![3_BarrierExpand](https://user-images.githubusercontent.com/5312595/176446385-0374b6ba-7c0b-447d-90f9-c73e3aee4918.png) > > We then hit the assert because the `CatchProj` has no user. Similar to how https://github.com/openjdk/jdk/pull/3012 was fixed, we need to make sure that when `RegionNode` inputs are cut off because their types are `top`, they are added to the IGVN worklist (see change in `cfgnode.cpp:504`). With that, the entire dead subgraph is removed. > > 2) `Unknown node on this path` assert while walking the memory graph during scalar replacement: > > After parsing, the `167 Region` that belongs to a loop loses entry control (marked in red): > ![2_Diff_Parsing_IGVN](https://user-images.githubusercontent.com/5312595/176453465-95f48c16-6cb7-4373-baa8-edf5e4fbcde2.png) > > The dead loop is not detected/removed because it's not considered "unsafe" since the Phis of the dying Region only have a Call user which is considered safe: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L352-L355 > > ![DyingRegion](https://user-images.githubusercontent.com/5312595/176469880-f81a7d7e-b769-444a-bf5b-14f8cca1f9af.png) > > The same can happen with other CFG users (for example, MemBars or Allocates). These scenarios are also covered by the regression test. Later during IGVN, `309 Region` which is part of the now dead subgraph is processed and found to be potentially "unsafe" and unreachable from root: > > ![1_AfterParsing](https://user-images.githubusercontent.com/5312595/176453110-8a4a587f-f1ef-45bf-8a68-e476f142aa7e.png) > > It's then removed together with its Phi users, leaving `505 MergeMem` with a top memory input: > > ![3_MacroExpansion](https://user-images.githubusercontent.com/5312595/176461343-ab446fe0-04a8-48a5-95c2-c8ead6c872cf.png) > > We then hit the assert when encountering a top memory input while walking the memory graph during scalar replacement. > > The root cause of the failure is an only partially removed dead subgraph. A similar issue has been fixed long ago by [JDK-8075922](https://bugs.openjdk.org/browse/JDK-8075922), but the fix is incomplete. I propose to aggressively remove such dead subgraphs by walking up the CFG when detecting an unreachable Region belonging to an "unsafe" loop and replacing all nodes by `top`. > > Special thanks to Christian Hagedorn for helping me with finding a regression test. > > Thanks, > Tobias Thanks for the review, Christian! ------------- PR: https://git.openjdk.org/jdk19/pull/92 From jbhateja at openjdk.org Thu Jun 30 08:48:25 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 30 Jun 2022 08:48:25 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls [v2] In-Reply-To: <1aiCitX9Awl030q7myghYyOwZNfqJMIdCMmGm9jfoOQ=.2a5a7cd7-9e57-4538-8e22-7a9cf7523343@github.com> References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> <1aiCitX9Awl030q7myghYyOwZNfqJMIdCMmGm9jfoOQ=.2a5a7cd7-9e57-4538-8e22-7a9cf7523343@github.com> Message-ID: On Wed, 29 Jun 2022 21:16:22 GMT, Evgeny Astigeevich wrote: >>> > GHA testing is not clean. >>> > I looked through changes and they seem logically correct. Need more testing. I will wait when GHA is clean. >>> >>> Vladimir(@vnkozlov), Have you got testing results? >> >> What I meant is that I will not submit my own testing until GitHub action testing is clean. Which is not which means something is wrong with changes: >> https://github.com/openjdk/jdk/pull/8816/checks?check_run_id=6998367114 >> >> Please, fix issues and update to latest JDK sources. > > @vnkozlov, with updating to the latest sources everything passed: https://github.com/eastig/jdk/actions/runs/2583924985 Hi @eastig , Are these memory saving shown on Renaissance with in-lining disabled ? Since static methods resolutions happen at compile time smaller methods may get inlined thus removing emission of stub. We are improving memory footprint of a method in code cache, does it also leads to some improvement in bench mark throughput by any means ? Once the code cache is full runtime attempts allocation extension if reserved code cache has space, followed by a [slow path of disabling the compilation.](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/codeCache.cpp#L595) which may impact the performance. ------------- PR: https://git.openjdk.org/jdk/pull/8816 From jbhateja at openjdk.org Thu Jun 30 08:52:42 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 30 Jun 2022 08:52:42 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: <5qp3iooejJfUnaGIbVWAdyyA0wEvT3EBQbRCim3UwxY=.25b28a11-af97-49b2-ab05-427cea59841b@github.com> On Mon, 20 Jun 2022 07:50:09 GMT, Xiaohong Gong wrote: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". Marked as reviewed by jbhateja (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/9211 From jbhateja at openjdk.org Thu Jun 30 08:52:43 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 30 Jun 2022 08:52:43 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 07:35:08 GMT, Xiaohong Gong wrote: > > > > Thanks so much for the advice @jatin-bhateja ! I basically agree with this idea! It seems the similar optimization can also be applied to other binary arithmetic vector operations like `add, sub, mul, div, shift` ? So do you think it's better we create another patch special to handle the constant folding for such vector nodes? We'd better find a better way to handle this while not add the same transformation for each node. WDYT? Agree, addressing it in subsequent PR along with other operations should be ok. Thanks. LGTM. ------------- PR: https://git.openjdk.org/jdk/pull/9211 From xgong at openjdk.org Thu Jun 30 08:52:44 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 08:52:44 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 08:47:55 GMT, Jatin Bhateja wrote: >>> >> >> Thanks so much for the advice @jatin-bhateja ! I basically agree with this idea! It seems the similar optimization can also be applied to other binary arithmetic vector operations like `add, sub, mul, div, shift` ? So do you think it's better we create another patch special to handle the constant folding for such vector nodes? We'd better find a better way to handle this while not add the same transformation for each node. WDYT? > >> > >> >> Thanks so much for the advice @jatin-bhateja ! I basically agree with this idea! It seems the similar optimization can also be applied to other binary arithmetic vector operations like `add, sub, mul, div, shift` ? So do you think it's better we create another patch special to handle the constant folding for such vector nodes? We'd better find a better way to handle this while not add the same transformation for each node. WDYT? > > Agree, addressing it in subsequent PR along with other operations should be ok. Thanks. > > LGTM. Thanks for the reivew @jatin-bhateja ! ------------- PR: https://git.openjdk.org/jdk/pull/9211 From xgong at openjdk.org Thu Jun 30 08:56:59 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 08:56:59 GMT Subject: RFR: 8287984: AArch64: [vector] Make all bits set vector sharable for match rules In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 04:50:53 GMT, Vladimir Kozlov wrote: >> We have the optimized rules for vector not/and_not in NEON and SVE, like: >> >> >> match(Set dst (XorV src (ReplicateB m1))) ; vector not >> match(Set dst (AndV src1 (XorV src2 (ReplicateB m1)))) ; vector and_not >> >> >> where "`m1`" is a ConI node with value -1. And we also have the similar rules for vector mask in SVE like: >> >> >> match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))) ; mask and_not >> >> >> These rules are not easy to be matched since the "`Replicate`" or "`MaskAll`" node is usually not single used for the `not/and_not` operation. To make these rules be matched as expected, this patch adds the vector (mask) "`not`" pattern to `Matcher::pd_clone_node()` which makes the all bits set vector `(Replicate/MaskAll)` sharable during matching rules. > > My testing passed. Thanks for looking at this PR @vnkozlov @nick-arm ! ------------- PR: https://git.openjdk.org/jdk/pull/9292 From xgong at openjdk.org Thu Jun 30 08:57:02 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 08:57:02 GMT Subject: Integrated: 8287984: AArch64: [vector] Make all bits set vector sharable for match rules In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 01:37:03 GMT, Xiaohong Gong wrote: > We have the optimized rules for vector not/and_not in NEON and SVE, like: > > > match(Set dst (XorV src (ReplicateB m1))) ; vector not > match(Set dst (AndV src1 (XorV src2 (ReplicateB m1)))) ; vector and_not > > > where "`m1`" is a ConI node with value -1. And we also have the similar rules for vector mask in SVE like: > > > match(Set pd (AndVMask pn (XorVMask pm (MaskAll m1)))) ; mask and_not > > > These rules are not easy to be matched since the "`Replicate`" or "`MaskAll`" node is usually not single used for the `not/and_not` operation. To make these rules be matched as expected, this patch adds the vector (mask) "`not`" pattern to `Matcher::pd_clone_node()` which makes the all bits set vector `(Replicate/MaskAll)` sharable during matching rules. This pull request has now been integrated. Changeset: 1305fb5c Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/1305fb5ca8e4ca6aa082293e4444fb7de1b1652c Stats: 138 lines in 3 files changed: 127 ins; 1 del; 10 mod 8287984: AArch64: [vector] Make all bits set vector sharable for match rules Reviewed-by: kvn, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/9292 From xgong at openjdk.org Thu Jun 30 10:56:38 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 10:56:38 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v8] In-Reply-To: References: Message-ID: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9037/files - new: https://git.openjdk.org/jdk/pull/9037/files/061a19fb..8bda7813 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=06-07 Stats: 10 lines in 1 file changed: 3 ins; 6 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Thu Jun 30 11:07:09 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 30 Jun 2022 11:07:09 GMT Subject: RFR: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations [v2] In-Reply-To: References: Message-ID: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge branch 'jdk:master' into JDK-8288294 - 8288294: [vector] Add Identity/Ideal transformations for vector logic operations ------------- Changes: https://git.openjdk.org/jdk/pull/9211/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9211&range=01 Stats: 639 lines in 4 files changed: 629 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/9211.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9211/head:pull/9211 PR: https://git.openjdk.org/jdk/pull/9211 From aph at openjdk.org Thu Jun 30 13:51:03 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 30 Jun 2022 13:51:03 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg In-Reply-To: <_lP7-1R69GHQ1ETdUxb_motCZoWus5aiaCYFtvySDJg=.8158500b-ea2c-441c-b98b-48d671fdef76@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> <_lP7-1R69GHQ1ETdUxb_motCZoWus5aiaCYFtvySDJg=.8158500b-ea2c-441c-b98b-48d671fdef76@github.com> Message-ID: On Mon, 27 Jun 2022 15:01:42 GMT, Andrew Haley wrote: > > I think the patch looks good overall, but it looks like there are some failures in some of the SA tests. > > Right. I'll start digging. Fixed now. ------------- PR: https://git.openjdk.org/jdk/pull/9276 From jvernee at openjdk.org Thu Jun 30 13:59:32 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Thu, 30 Jun 2022 13:59:32 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg [v4] In-Reply-To: <-g2xnAbFWX2EJ_Pg728btZGewLTHvJ_ynehX_VC67qw=.1e85a6d5-2896-4d36-8b1c-6f9ae29ad2dd@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> <-g2xnAbFWX2EJ_Pg728btZGewLTHvJ_ynehX_VC67qw=.1e85a6d5-2896-4d36-8b1c-6f9ae29ad2dd@github.com> Message-ID: On Thu, 30 Jun 2022 08:38:32 GMT, Andrew Haley wrote: >> Like class `Register`, class `VMReg` exhibits undefined behaviour, in particular null pointer dereferences. >> >> The right way to fix this is simple: make instances of `VMReg` point to reified instances of `VMRegImpl`. We do this by creating a static array of `VMRegImpl`, and making all `VMReg` instances point into it, making the code well defined. >> >> However, while `VMReg` instances are no longer null, and so do not generate compile warnings or errors, there is still a problem in that higher-numbered `VMReg` instances point outside the static array of `VMRegImpl`. This is hard to avoid, given that (as far as I can tell) there is no upper limit on the number of stack slots that can be allocated as `VMReg` instances. While this is in theory UB, it's not likely to cause problems. We could fix this by creating a much larger static array of `VMRegImpl`, up to the largest plausible size of stack offsets. >> >> We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > 8289060: Undefined Behaviour in class VMReg LGTM ------------- Marked as reviewed by jvernee (Reviewer). PR: https://git.openjdk.org/jdk/pull/9276 From stuefe at openjdk.org Thu Jun 30 14:30:07 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 30 Jun 2022 14:30:07 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp Message-ID: This fixes three warnings in my gcc 12 build on Ubuntu 22.04. ------------- Commit messages: - fix adlc gcc12 build Changes: https://git.openjdk.org/jdk/pull/9335/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9335&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289512 Stats: 13 lines in 1 file changed: 1 ins; 4 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9335.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9335/head:pull/9335 PR: https://git.openjdk.org/jdk/pull/9335 From kvn at openjdk.org Thu Jun 30 14:35:20 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 14:35:20 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg [v4] In-Reply-To: <-g2xnAbFWX2EJ_Pg728btZGewLTHvJ_ynehX_VC67qw=.1e85a6d5-2896-4d36-8b1c-6f9ae29ad2dd@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> <-g2xnAbFWX2EJ_Pg728btZGewLTHvJ_ynehX_VC67qw=.1e85a6d5-2896-4d36-8b1c-6f9ae29ad2dd@github.com> Message-ID: On Thu, 30 Jun 2022 08:38:32 GMT, Andrew Haley wrote: >> Like class `Register`, class `VMReg` exhibits undefined behaviour, in particular null pointer dereferences. >> >> The right way to fix this is simple: make instances of `VMReg` point to reified instances of `VMRegImpl`. We do this by creating a static array of `VMRegImpl`, and making all `VMReg` instances point into it, making the code well defined. >> >> However, while `VMReg` instances are no longer null, and so do not generate compile warnings or errors, there is still a problem in that higher-numbered `VMReg` instances point outside the static array of `VMRegImpl`. This is hard to avoid, given that (as far as I can tell) there is no upper limit on the number of stack slots that can be allocated as `VMReg` instances. While this is in theory UB, it's not likely to cause problems. We could fix this by creating a much larger static array of `VMRegImpl`, up to the largest plausible size of stack offsets. >> >> We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > 8289060: Undefined Behaviour in class VMReg Looks reasonable. Let me test it. ------------- PR: https://git.openjdk.org/jdk/pull/9276 From kvn at openjdk.org Thu Jun 30 14:51:30 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 14:51:30 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 13:51:00 GMT, Thomas Stuefe wrote: > This fixes three warnings in my gcc 12 build on Ubuntu 22.04. src/hotspot/share/adlc/output_c.cpp line 527: > 525: ndx+1, element_count, resource_mask); > 526: > 527: // "0x012345678, 0x012345678, 4294967295" please add assert like next: ```assert((9 + 2*masklen + maskdigit) <= 37, "invalid value: masklen=%d, maskdigit=%d", masklen, maskdigit);``` ------------- PR: https://git.openjdk.org/jdk/pull/9335 From stuefe at openjdk.org Thu Jun 30 15:21:36 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 30 Jun 2022 15:21:36 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp [v2] In-Reply-To: References: Message-ID: > This fixes three warnings in my gcc 12 build on Ubuntu 22.04. Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: assert sprint did not overflow ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9335/files - new: https://git.openjdk.org/jdk/pull/9335/files/eef0fe64..c3bdc59b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9335&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9335&range=00-01 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9335.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9335/head:pull/9335 PR: https://git.openjdk.org/jdk/pull/9335 From stuefe at openjdk.org Thu Jun 30 15:21:38 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 30 Jun 2022 15:21:38 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp [v2] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 14:48:10 GMT, Vladimir Kozlov wrote: >> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: >> >> assert sprint did not overflow > > src/hotspot/share/adlc/output_c.cpp line 527: > >> 525: ndx+1, element_count, resource_mask); >> 526: >> 527: // "0x012345678, 0x012345678, 4294967295" > > please add assert like next: > ```assert((9 + 2*masklen + maskdigit) <= 37, "invalid value: masklen=%d, maskdigit=%d", masklen, maskdigit);``` I'm not sure that makes sense, since neither masklen nor maskdigit are part of printing anymore. I added a check for overflow, does this test what you wanted? ------------- PR: https://git.openjdk.org/jdk/pull/9335 From kvn at openjdk.org Thu Jun 30 16:07:46 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 16:07:46 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp [v2] In-Reply-To: References: Message-ID: <8Ldc39Cx8nywg-ioyszJd6avqnpMne3GduVVW7rhsOA=.c3d6fa42-3a30-44cd-812f-732f4a34b479@github.com> On Thu, 30 Jun 2022 15:21:36 GMT, Thomas Stuefe wrote: >> This fixes three warnings in my gcc 12 build on Ubuntu 22.04. > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > assert sprint did not overflow Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9335 From kvn at openjdk.org Thu Jun 30 16:07:48 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 16:07:48 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp [v2] In-Reply-To: References: Message-ID: <6xV8tozLFAxchfts7dzt6tZCn-m1xkAE3tl5WbJJQ5I=.7a79d83a-d247-4a41-b29c-653f2924c360@github.com> On Thu, 30 Jun 2022 15:18:16 GMT, Thomas Stuefe wrote: >> src/hotspot/share/adlc/output_c.cpp line 527: >> >>> 525: ndx+1, element_count, resource_mask); >>> 526: >>> 527: // "0x012345678, 0x012345678, 4294967295" >> >> please add assert like next: >> ```assert((9 + 2*masklen + maskdigit) <= 37, "invalid value: masklen=%d, maskdigit=%d", masklen, maskdigit);``` > > I'm not sure that makes sense, since neither masklen nor maskdigit are part of printing anymore. > > I added a check for overflow, does this test what you wanted? Yes, that is what I wanted. ------------- PR: https://git.openjdk.org/jdk/pull/9335 From stuefe at openjdk.org Thu Jun 30 17:48:40 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 30 Jun 2022 17:48:40 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp [v2] In-Reply-To: <8Ldc39Cx8nywg-ioyszJd6avqnpMne3GduVVW7rhsOA=.c3d6fa42-3a30-44cd-812f-732f4a34b479@github.com> References: <8Ldc39Cx8nywg-ioyszJd6avqnpMne3GduVVW7rhsOA=.c3d6fa42-3a30-44cd-812f-732f4a34b479@github.com> Message-ID: <-KFVEl4AU7Z0bjqblCLwjYlALBZZ9ma9LzmQL-e81zM=.e7f4ae1e-6cca-4bcf-afa5-293d4b86add8@github.com> On Thu, 30 Jun 2022 16:04:43 GMT, Vladimir Kozlov wrote: > Good. Thanks, Vladimir. ------------- PR: https://git.openjdk.org/jdk/pull/9335 From kvn at openjdk.org Thu Jun 30 17:53:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 17:53:52 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v8] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 10:56:38 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments An other test failed in tier2: compiler/loopopts/superword/TestPickFirstMemoryState.java Details are in RFE. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From kvn at openjdk.org Thu Jun 30 19:24:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 30 Jun 2022 19:24:41 GMT Subject: RFR: 8289060: Undefined Behaviour in class VMReg [v4] In-Reply-To: <-g2xnAbFWX2EJ_Pg728btZGewLTHvJ_ynehX_VC67qw=.1e85a6d5-2896-4d36-8b1c-6f9ae29ad2dd@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> <-g2xnAbFWX2EJ_Pg728btZGewLTHvJ_ynehX_VC67qw=.1e85a6d5-2896-4d36-8b1c-6f9ae29ad2dd@github.com> Message-ID: On Thu, 30 Jun 2022 08:38:32 GMT, Andrew Haley wrote: >> Like class `Register`, class `VMReg` exhibits undefined behaviour, in particular null pointer dereferences. >> >> The right way to fix this is simple: make instances of `VMReg` point to reified instances of `VMRegImpl`. We do this by creating a static array of `VMRegImpl`, and making all `VMReg` instances point into it, making the code well defined. >> >> However, while `VMReg` instances are no longer null, and so do not generate compile warnings or errors, there is still a problem in that higher-numbered `VMReg` instances point outside the static array of `VMRegImpl`. This is hard to avoid, given that (as far as I can tell) there is no upper limit on the number of stack slots that can be allocated as `VMReg` instances. While this is in theory UB, it's not likely to cause problems. We could fix this by creating a much larger static array of `VMRegImpl`, up to the largest plausible size of stack offsets. >> >> We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > 8289060: Undefined Behaviour in class VMReg My testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9276 From duke at openjdk.org Thu Jun 30 21:29:39 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 30 Jun 2022 21:29:39 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls [v2] In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> <1aiCitX9Awl030q7myghYyOwZNfqJMIdCMmGm9jfoOQ=.2a5a7cd7-9e57-4538-8e22-7a9cf7523343@github.com> Message-ID: On Thu, 30 Jun 2022 08:45:21 GMT, Jatin Bhateja wrote: >> @vnkozlov, with updating to the latest sources everything passed: https://github.com/eastig/jdk/actions/runs/2583924985 > > Hi @eastig , > Are these memory saving shown on Renaissance with in-lining disabled ? > Since static methods resolutions happen at compile time smaller methods may get inlined thus removing emission of stub. > > We are improving memory footprint of a method in code cache, does it also leads to some improvement in bench mark throughput by any means ? > > Once the code cache is full runtime attempts allocation extension if reserved code cache has space, followed by a [slow path of disabling the compilation.](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/codeCache.cpp#L595) which may impact the performance. Hi @jatin-bhateja, > Are these memory saving shown on Renaissance with in-lining disabled ? Except the Java heap size tuning, JVM was run in the default configuration. No changes to inlining were done. > Since static methods resolutions happen at compile time smaller methods may get inlined thus removing emission of stub. You are correct. You can see this in data: the total number of nmethods with shared stubs. For arm64 a stub to the interpreter is 8 instructions. For x86 it is just 3 instructions: mov, jmp and nop. Or in terms of code size: 32 bytes arm64 vs 16 bytes x86. Arm64 gets more benefits from the patch than x86. > We are improving memory footprint of a method in code cache, does it also leads to some improvement in bench mark throughput by any means ? There are a few patches, including this one, which improve memory footprint of a method, each of them separately does not show much performance improvement. However all together they demonstrate performance improvements, especially in benchmarks with tens of thousands of methods. For example DaCapo eclipse shows ~4% improvements on arm64. > Once the code cache is full runtime attempts allocation extension if reserved code cache has space, followed by a [slow path of disabling the compilation.](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/codeCache.cpp#L595) which may impact the performance. It's even worse than this. Andrew mentioned CodeCache trashing in PR to change the default CodeCache size from 240M to 127 for arm64. With CodeCache trashing compilation does not stop. You constantly throw away compiled code, jump to the interpreter, recompile and jump to the newly compiled code. We've got evidence from a real service it is not a good idea to reduce the default size. The service had CodeCache trashing. Performance degradation caused by CodeCache trashing is huge and really hard to detect. ------------- PR: https://git.openjdk.org/jdk/pull/8816 From duke at openjdk.org Thu Jun 30 21:35:36 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 30 Jun 2022 21:35:36 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Fri, 17 Jun 2022 09:25:18 GMT, Andrew Haley wrote: > > If we never patch the branch to the interpreter, we can optimize it at link time either to a direct branch or an adrp based far jump. I also created https://bugs.openjdk.org/browse/JDK-8286142 to reduce metadata mov instructions. > > If we emit the address of the interpreter once, at the start of the stub section, we can replace the branch to the interpreter with `ldr rscratch1, adr; br rscratch1`. @theRealAph, thank you for the idea. I created https://bugs.openjdk.org/browse/JDK-8289057. ------------- PR: https://git.openjdk.org/jdk/pull/8816